This message was deleted.
# troubleshooting
s
This message was deleted.
s
The error means that the historical timed out when responding. What's the query? What's the data? The default query timeout is
druid.server.http.defaultQueryTimeout
j
druid.server.http.defaultQueryTimeout=60s, the query is segmentMetadata type, query is too long. I want to know the timeout maybe in historical, or in broker?
s
The error seems from the historical. What are your configs for broker and historicals?
j
broker config:
Copy code
# HTTP server settings
    ##druid.server.http.numThreads=60
    druid.server.http.defaultQueryTimeout=60000
    druid.server.http.maxQueryTimeout=60000

    # HTTP client settings
    druid.broker.http.numConnections=35
    druid.broker.http.maxQueuedBytes=50000000

    # Processing threads and buffers
    druid.processing.buffer.sizeBytes=536870912
    druid.processing.numMergeBuffers=4
    druid.processing.numThreads=13

    druid.query.groupBy.maxOnDiskStorage=104857600

    # Query cache disabled -- push down caching and merging instead
    druid.broker.cache.useCache=false
    druid.broker.cache.populateCache=false

    druid.sql.enable=true
    # We need to disable after broker started through kubectl edit for fast failed recovery
    # We need to set it true when rolling upgrade
    # Default true
    druid.sql.planner.awaitInitializationOnStart=true
historical config:
Copy code
# HTTP server threads
    druid.server.http.numThreads=115
    druid.server.http.defaultQueryTimeout=60000
    druid.server.http.maxQueryTimeout=60000

    # Processing threads and buffers
    druid.processing.buffer.sizeBytes=1073741824
    druid.processing.numThreads=30
    druid.processing.numMergeBuffers=50

    druid.query.groupBy.maxOnDiskStorage=104857600
    druid.segmentCache.numLoadingThreads=40

    # Segment storage
    druid.segmentCache.locations=[{"path":"/var/druid/segment-cache","maxSize":1900000000000}]
    druid.server.maxSize=1900000000000
    druid.segmentCache.lazyLoadOnStart=true

    # Query cache
    druid.historical.cache.useCache=true
    druid.historical.cache.populateCache=true
    druid.cache.type=caffeine
    druid.cache.sizeInBytes=10737418240
    druid.historical.cache.unCacheable=[]
There has 40 historical and 8 broker. I know that historical threads not enough, it should increase. But when the error happen, only one broker is startup, others are down. And because brokers are down, historical not too busy. After some times, and brokers retry and retry, the segment metadata refresh success. I this the metadata in historical memory, it should be return quickly. The problem is most likely to occur on the broker?
s
i see you've commented out the http threads for the broker... the recommendation is:
Copy code
druid.server.http.numThreads on the Broker should be set to a value slightly higher than druid.broker.http.numConnections on the same Broker.
8 brokers seems like a lot for 40 historicals, the ratio is usually 1:10, so 4 should be a good starting point. Another issue is that the http.numThreads for the historicals, the recommendation is:
Copy code
For Historicals, druid.server.http.numThreads should be set to a value slightly higher than the sum of druid.broker.http.numConnections across all the Brokers in the cluster.
So given 8 brokers at 35 numConnections each. Historicals and Middle Manager peons should be configured to 8*35+10 = 290. with 4 brokers =150. Depending on what was going on in the historicals, it could be that they ran out of http threads and could not respond to the broker's requests.
If you haven't seen it already... the basic cluster tuning guide has a lot of great info on tuning a configuration: https://druid.apache.org/docs/latest/operations/basic-cluster-tuning.html
j
Thank you so much! I will adjust the parameters of the broker and historical nodes as you instructed. And I’m still curious to know the root cause of being blocked when the broker starts, what causes metadata refresh failed?
s
👍 Thanks for the update, let us know how it goes