This message was deleted Apache Druid #troubleshooting

Join Slack

This message was deleted.

# troubleshooting

Slackbot

05/26/2023, 8:32 AM

This message was deleted.

Sergio Ferragut

05/30/2023, 7:11 PM

The error means that the historical timed out when responding. What's the query? What's the data? The default query timeout is

druid.server.http.defaultQueryTimeout

Jiaojiao Fu

06/05/2023, 8:28 AM

druid.server.http.defaultQueryTimeout=60s, the query is segmentMetadata type, query is too long. I want to know the timeout maybe in historical, or in broker?

Sergio Ferragut

06/07/2023, 7:54 PM

The error seems from the historical. What are your configs for broker and historicals?

Jiaojiao Fu

06/09/2023, 6:36 AM

broker config:

Copy code

# HTTP server settings
    ##druid.server.http.numThreads=60
    druid.server.http.defaultQueryTimeout=60000
    druid.server.http.maxQueryTimeout=60000

    # HTTP client settings
    druid.broker.http.numConnections=35
    druid.broker.http.maxQueuedBytes=50000000

    # Processing threads and buffers
    druid.processing.buffer.sizeBytes=536870912
    druid.processing.numMergeBuffers=4
    druid.processing.numThreads=13

    druid.query.groupBy.maxOnDiskStorage=104857600

    # Query cache disabled -- push down caching and merging instead
    druid.broker.cache.useCache=false
    druid.broker.cache.populateCache=false

    druid.sql.enable=true
    # We need to disable after broker started through kubectl edit for fast failed recovery
    # We need to set it true when rolling upgrade
    # Default true
    druid.sql.planner.awaitInitializationOnStart=true

Jiaojiao Fu

06/09/2023, 6:37 AM

historical config:

Copy code

# HTTP server threads
    druid.server.http.numThreads=115
    druid.server.http.defaultQueryTimeout=60000
    druid.server.http.maxQueryTimeout=60000

    # Processing threads and buffers
    druid.processing.buffer.sizeBytes=1073741824
    druid.processing.numThreads=30
    druid.processing.numMergeBuffers=50

    druid.query.groupBy.maxOnDiskStorage=104857600
    druid.segmentCache.numLoadingThreads=40

    # Segment storage
    druid.segmentCache.locations=[{"path":"/var/druid/segment-cache","maxSize":1900000000000}]
    druid.server.maxSize=1900000000000
    druid.segmentCache.lazyLoadOnStart=true

    # Query cache
    druid.historical.cache.useCache=true
    druid.historical.cache.populateCache=true
    druid.cache.type=caffeine
    druid.cache.sizeInBytes=10737418240
    druid.historical.cache.unCacheable=[]

Jiaojiao Fu

06/09/2023, 6:42 AM

There has 40 historical and 8 broker. I know that historical threads not enough, it should increase. But when the error happen, only one broker is startup, others are down. And because brokers are down, historical not too busy. After some times, and brokers retry and retry, the segment metadata refresh success. I this the metadata in historical memory, it should be return quickly. The problem is most likely to occur on the broker？

Sergio Ferragut

06/09/2023, 5:37 PM

i see you've commented out the http threads for the broker... the recommendation is:

Copy code

druid.server.http.numThreads on the Broker should be set to a value slightly higher than druid.broker.http.numConnections on the same Broker.

8 brokers seems like a lot for 40 historicals, the ratio is usually 1:10, so 4 should be a good starting point. Another issue is that the http.numThreads for the historicals, the recommendation is:

Copy code

For Historicals, druid.server.http.numThreads should be set to a value slightly higher than the sum of druid.broker.http.numConnections across all the Brokers in the cluster.

So given 8 brokers at 35 numConnections each. Historicals and Middle Manager peons should be configured to 8*35+10 = 290. with 4 brokers =150. Depending on what was going on in the historicals, it could be that they ran out of http threads and could not respond to the broker's requests.

Sergio Ferragut

06/09/2023, 5:38 PM

If you haven't seen it already... the basic cluster tuning guide has a lot of great info on tuning a configuration: https://druid.apache.org/docs/latest/operations/basic-cluster-tuning.html

Jiaojiao Fu

06/11/2023, 8:50 AM

Thank you so much! I will adjust the parameters of the broker and historical nodes as you instructed. And I’m still curious to know the root cause of being blocked when the broker starts, what causes metadata refresh failed?

Sergio Ferragut

06/12/2023, 10:51 PM

👍 Thanks for the update, let us know how it goes

Open in Slack

Previous Next