This message was deleted Apache Druid #troubleshooting

Join Slack

This message was deleted.

# troubleshooting

Slackbot

05/09/2023, 4:06 AM

This message was deleted.

Neelesh Sharma

05/09/2023, 4:08 AM

some config regarding data nodes if it helps: instance type:

c6gd.4xlarge

vCPUs: 16 Memory (GiB): 32.0 count: 3 instances jvm.config

Copy code

-Xms8g
-Xmx8g
-XX:MaxDirectMemorySize=10g

runtime.properties

Copy code

# HTTP server threads
druid.server.http.numThreads=60

# Processing threads and buffers
druid.processing.buffer.sizeBytes=500MiB
## numMergeBuffers should be around numThreads/4
druid.processing.numMergeBuffers=4
## numThreads should be vCPU-1
druid.processing.numThreads=15
druid.processing.tmpDir=var/druid/processing

# Query cache
druid.historical.cache.useCache=true
druid.historical.cache.populateCache=true
druid.cache.type=caffeine
druid.cache.sizeInBytes=256MiB

Sergio Ferragut

05/09/2023, 7:00 PM

Some thoughts: What is the query? It may be simple but still use up a significant amount of resources. Are your data nodes running both MMs and Historicals? Does this coincide with ingestion workloads? If so, you'll likely need to distribute the CPUs among the

druid.processing.numThreads

for the historical and the

druid.worker.capacity

of the middle managers and consider the memory footprint of each such that all Peons, the MM and the Historical all fit within the 32g of the node.

Neelesh Sharma

05/10/2023, 3:23 AM

yep, that is correct. we've got both middle manager and historical running on the same node .. i see, currently we have put all the threads with vcpu-1 for historicals

Neelesh Sharma

05/10/2023, 3:30 AM

but the timeouts don't always coincide with the ingestion activities ...

Neelesh Sharma

05/10/2023, 6:49 AM

but one thing i forgot to mention was that the qps is also quite high peak: 300 queries per seconds

Gian Merlino

05/10/2023, 7:31 PM

if data server CPU only goes to 50–60%, you are doing high QPS and getting some timeouts, and you have already maxed

druid.processing.numThreads

then I suggest spending some time figuring out why CPU isn't at 90–100% It sounds like your system is at/near max load (based on the timeouts)— ordinarily we want CPU to be maxed when this is the case, not hovering at 50–60%. Could mean some inefficiency somewhere.

Gian Merlino

05/10/2023, 7:32 PM

I would first try adding more data servers and seeing if that improves your situation with the timeouts. If not then it's likely an inefficiency at the Broker. If it does go up then it's likely an inefficiency at the data servers

Gian Merlino

05/10/2023, 7:32 PM

Once you know that, then you can poke deeper

Neelesh Sharma

05/11/2023, 6:27 AM

i see ..understood. will increase the data node count to 4 and monitor. wondering if there's any other metrics / stats we should be monitoring

Neelesh Sharma

05/11/2023, 6:29 AM

sharing a few which we have been monitoring currently.

Neelesh Sharma

05/11/2023, 6:34 AM

also sharing query nodes config (2 r6g.xlarge - 4 vcpu, and 32 gigs of mem)

Copy code

## broker jvm.config
# HTTP server settings
druid.server.http.numThreads=60
druid.server.http.maxSubqueryRows=600000

# HTTP client settings
druid.broker.http.numConnections=50
druid.broker.http.maxQueuedBytes=10MiB

# Processing threads and buffers
druid.processing.buffer.sizeBytes=500MiB
druid.processing.numMergeBuffers=6
druid.processing.tmpDir=var/druid/processing

# Query cache disabled -- push down caching and merging instead
druid.broker.cache.useCache=false
druid.broker.cache.populateCache=false

and broker jvm.config

Copy code

-server
-Xms18g
-Xmx18g
-XX:MaxDirectMemorySize=6g

Neelesh Sharma

05/11/2023, 10:26 AM

i see ..understood. will increase the data node count to 4 and monitor.

looking good so far! thanks @Gian Merlino, @Sergio Ferragut

I would first try adding more data servers and seeing if that improves your situation with the timeouts. If not then it's likely an inefficiency at the Broker. If it does go up then it's likely an inefficiency at the data servers

will continue monitoring, but i guess early signs points toward "inefficiency at the data servers" 🤔

Neelesh Sharma

05/11/2023, 11:00 AM

🥹 i spoke too soon. got a spike again .. so something to do with query nodes then

Neelesh Sharma

05/15/2023, 12:04 PM

bumping this up ^ @Gian Merlino / @Sergio Ferragut

Sergio Ferragut

05/15/2023, 3:45 PM

Is there a particular query that does this? Is there a large join or high cardinality group by occurring in such queries? How many query nodes do you have? Are the http threads in the historicals high enough to accommodate all broker numConnections?

Sergio Ferragut

05/15/2023, 3:47 PM

I went back and read your configs...According to that recommendation, your historicals should have about 110 for httpNumThreads. It is possible that your data nodes are running out of connection threads when both brokers are dealing with high concurrency.

👀 1

Neelesh Sharma

05/16/2023, 5:10 AM

Is there a particular query that does this

not that i can tell 🤔 the most frequent one looks like a simple query with no join

Copy code

select avg(ratings) from ratings where user = '' and __time in last month

Neelesh Sharma

05/16/2023, 5:29 AM

for the number of threads, actually we're a bit confused since there are 2 sets of documentations around this .. one on https://druid.apache.org/docs/latest/operations/basic-cluster-tuning.html

Copy code

For Historicals, druid.server.http.numThreads should be set to a value slightly higher than the sum of druid.broker.http.numConnections across all the Brokers in the cluster.

Copy code

On the Brokers, please ensure that the sum of druid.broker.http.numConnections across all the Brokers is slightly lower than the value of druid.server.http.numThreads on your Historicals and Tasks.

and on https://druid.apache.org/docs/latest/configuration/index.html the doc says to use this formula for

max(10, (Number of cores * 17) / 16 + 2) + 30

for historical, indexer and broker so since our • historical has 16 vCPUs it comes out to be 49 and for • query nodes has 4 vCPUs it comes out to be 40

Sergio Ferragut

05/16/2023, 6:17 PM

The historical cpu based formula is for its processing threads, the other calc is for http threads.

👀 1

Neelesh Sharma

05/17/2023, 2:48 AM

i thought for processing threads it's just vcpus-1

Sergio Ferragut

05/17/2023, 6:03 PM

Right, for Historicals, processing threads v CPU’s -1 Http threads = sum of broker num connections +10 or so for internal communication.

Neelesh Sharma

06/12/2023, 2:40 AM

hi, @Sergio Ferragut in order to further investigate we think that the culprit is a query which is scanning many segments (30 days with daily granularity) and for row level data instead of aggregated data. we've also captured more metrics which might be helpful. during the issue, the query wait time, segment time and segment scan pending -- all of these metrics on historical have a spike. also all the requests are failing on historical with timeout error and not of broker nodes.

Sergio Ferragut

06/12/2023, 10:47 PM

Glad you found the culprit. Saturated Historicals can have this behavior because the processing threads are busy and cannot handle the simple small requests. Some implementations use Query Laning and/or Historical tiering to separate workloads in order to avoid this kind of problem. The number of segments might be the problem, so segment size optimization is also good to look at. Also, like you mentioned, creating an aggregate table for other use cases might make a lot of sense since it would allow you to aggregate once and then query it many times using much less resources over all vs. aggregating the detail across broad timeframes whenever the long-term queries are executed.

🙌 1

🙏 1

5 Views

Open in Slack

Previous Next