This message was deleted.
# troubleshooting
s
This message was deleted.
n
some config regarding data nodes if it helps: instance type:
c6gd.4xlarge
vCPUs: 16 Memory (GiB): 32.0 count: 3 instances jvm.config
Copy code
-Xms8g
-Xmx8g
-XX:MaxDirectMemorySize=10g
runtime.properties
Copy code
# HTTP server threads
druid.server.http.numThreads=60

# Processing threads and buffers
druid.processing.buffer.sizeBytes=500MiB
## numMergeBuffers should be around numThreads/4
druid.processing.numMergeBuffers=4
## numThreads should be vCPU-1
druid.processing.numThreads=15
druid.processing.tmpDir=var/druid/processing

# Query cache
druid.historical.cache.useCache=true
druid.historical.cache.populateCache=true
druid.cache.type=caffeine
druid.cache.sizeInBytes=256MiB
s
Some thoughts: What is the query? It may be simple but still use up a significant amount of resources. Are your data nodes running both MMs and Historicals? Does this coincide with ingestion workloads? If so, you'll likely need to distribute the CPUs among the
druid.processing.numThreads
for the historical and the
druid.worker.capacity
of the middle managers and consider the memory footprint of each such that all Peons, the MM and the Historical all fit within the 32g of the node.
n
yep, that is correct. we've got both middle manager and historical running on the same node .. i see, currently we have put all the threads with vcpu-1 for historicals
but the timeouts don't always coincide with the ingestion activities ...
but one thing i forgot to mention was that the qps is also quite high peak: 300 queries per seconds
g
if data server CPU only goes to 50–60%, you are doing high QPS and getting some timeouts, and you have already maxed
druid.processing.numThreads
then I suggest spending some time figuring out why CPU isn't at 90–100% It sounds like your system is at/near max load (based on the timeouts)— ordinarily we want CPU to be maxed when this is the case, not hovering at 50–60%. Could mean some inefficiency somewhere.
I would first try adding more data servers and seeing if that improves your situation with the timeouts. If not then it's likely an inefficiency at the Broker. If it does go up then it's likely an inefficiency at the data servers
Once you know that, then you can poke deeper
n
i see ..understood. will increase the data node count to 4 and monitor. wondering if there's any other metrics / stats we should be monitoring
sharing a few which we have been monitoring currently.
also sharing query nodes config (2 r6g.xlarge - 4 vcpu, and 32 gigs of mem)
Copy code
## broker jvm.config
# HTTP server settings
druid.server.http.numThreads=60
druid.server.http.maxSubqueryRows=600000

# HTTP client settings
druid.broker.http.numConnections=50
druid.broker.http.maxQueuedBytes=10MiB

# Processing threads and buffers
druid.processing.buffer.sizeBytes=500MiB
druid.processing.numMergeBuffers=6
druid.processing.tmpDir=var/druid/processing

# Query cache disabled -- push down caching and merging instead
druid.broker.cache.useCache=false
druid.broker.cache.populateCache=false
and broker jvm.config
Copy code
-server
-Xms18g
-Xmx18g
-XX:MaxDirectMemorySize=6g
i see ..understood. will increase the data node count to 4 and monitor.
looking good so far! thanks @Gian Merlino, @Sergio Ferragut
I would first try adding more data servers and seeing if that improves your situation with the timeouts. If not then it's likely an inefficiency at the Broker. If it does go up then it's likely an inefficiency at the data servers
will continue monitoring, but i guess early signs points toward "inefficiency at the data servers" 🤔
🥹 i spoke too soon. got a spike again .. so something to do with query nodes then
bumping this up ^ @Gian Merlino / @Sergio Ferragut
s
Is there a particular query that does this? Is there a large join or high cardinality group by occurring in such queries? How many query nodes do you have? Are the http threads in the historicals high enough to accommodate all broker numConnections?
I went back and read your configs...According to that recommendation, your historicals should have about 110 for httpNumThreads. It is possible that your data nodes are running out of connection threads when both brokers are dealing with high concurrency.
👀 1
n
Is there a particular query that does this
not that i can tell 🤔 the most frequent one looks like a simple query with no join
Copy code
select avg(ratings) from ratings where user = '' and __time in last month
for the number of threads, actually we're a bit confused since there are 2 sets of documentations around this .. one on https://druid.apache.org/docs/latest/operations/basic-cluster-tuning.html
Copy code
For Historicals, druid.server.http.numThreads should be set to a value slightly higher than the sum of druid.broker.http.numConnections across all the Brokers in the cluster.
Copy code
On the Brokers, please ensure that the sum of druid.broker.http.numConnections across all the Brokers is slightly lower than the value of druid.server.http.numThreads on your Historicals and Tasks.
and on https://druid.apache.org/docs/latest/configuration/index.html the doc says to use this formula for
max(10, (Number of cores * 17) / 16 + 2) + 30
for historical, indexer and broker so since our • historical has 16 vCPUs it comes out to be 49 and for • query nodes has 4 vCPUs it comes out to be 40
s
The historical cpu based formula is for its processing threads, the other calc is for http threads.
👀 1
n
i thought for processing threads it's just vcpus-1
s
Right, for Historicals, processing threads v CPU’s -1 Http threads = sum of broker num connections +10 or so for internal communication.
n
hi, @Sergio Ferragut in order to further investigate we think that the culprit is a query which is scanning many segments (30 days with daily granularity) and for row level data instead of aggregated data. we've also captured more metrics which might be helpful. during the issue, the query wait time, segment time and segment scan pending -- all of these metrics on historical have a spike. also all the requests are failing on historical with timeout error and not of broker nodes.
s
Glad you found the culprit. Saturated Historicals can have this behavior because the processing threads are busy and cannot handle the simple small requests. Some implementations use Query Laning and/or Historical tiering to separate workloads in order to avoid this kind of problem. The number of segments might be the problem, so segment size optimization is also good to look at. Also, like you mentioned, creating an aggregate table for other use cases might make a lot of sense since it would allow you to aggregate once and then query it many times using much less resources over all vs. aggregating the detail across broad timeframes whenever the long-term queries are executed.
🙌 1
🙏 1