https://linen.dev logo
Title
j

Jiaojiao Fu

05/25/2023, 1:32 AM
Good day all. I am using Druid on AWS EKS. All broker pods can't pass readiness probe, which is a health check. It will curl broker IP:8082/status/health. Can someone give me some advice?
Thank you for your reply. Our cluster has been running for a very long time, and yesterday, after we added several broker pods, this issue suddenly occurred. So it may not be a configuration issue
a

Abhishek Agarwal

05/25/2023, 4:32 AM
can you log into the pod and see what does
curl
command outputs? that might tell you where to look next. you can also check if for some reason, broker is taking too long to startup
j

Jiaojiao Fu

05/25/2023, 5:24 AM
when I log into pod and curl output as follow. But after 10 minutes the output is true. So now the pod be ready need maybe 10 minutes, before is 2 minutes . I don’t why it cost so long time.
a

Abhishek Agarwal

05/25/2023, 5:30 AM
I can help you troubleshoot though it would need a bit of work on your end as well 🙂 When we see this, it's usually the result of broker taking time to initialize its segment metadata view. So if there is a way for you to look at the metrics for segment metadata queries for the broker that you just started, you can see how much time those queries are taking. One other thing that you can do is take flame graphs right after the broker starts https://support.imply.io/hc/en-us/articles/360033747953-Profiling-Druid-queries-using-flame-graphs you should take 5 flame graphs collected over 2 minutes. If this is a bit too much for you, you can just increase the health check time for brokers and move on. But otherwise, these are some of the things you can do to learn more and help us learn more.
j

Jiaojiao Fu

05/25/2023, 5:45 AM
Thanks very much for all your help!🙌 In fact, I have indeed seen many metadata queries, and I will calculate the time consumption of these queries. Later I will open the flame diagram to find the root cuase. I’m not familiar with flame diagrams, I’ll take some time to study them. Thanks~
a

Abhishek Agarwal

05/25/2023, 5:47 AM
thanks. flame graphs seem daunting but are a very useful tool to do perf troubleshooting.
j

Jiaojiao Fu

05/25/2023, 6:09 AM
I observed the logs and found that the broker has always had a query for segmentMetadata. Do I need to check the query for segmentMetadata that was druid broker just started?
We have 4 historical pods, the first one took 10 minutes to start, and the second one was not ready for even longer. I generated a flame chart when these two pods started.
But in some cases, each pod takes about 10 minutes to start successfully. I saw some segment metadata refresh failed in the logs, as well as connection history timeout. May I ask if these will cause the block broker process to start?
a

Abhishek Agarwal

05/27/2023, 9:01 AM
broker should retry the failed queries. can you attach the broker logs? Also, is it possible for you to generate the flame graph as html as outlined in the link I shared. Lastly, we wanted the broker flame graph and not the historical flame graph.
j

Jiaojiao Fu

05/27/2023, 9:05 AM
I see the broker send the segment metadata queries, and usually the queries timed out. The warn log says connect historical-0 timed out.
I see the historical-0 log and cpu metrics,it is normal.
a

Abhishek Agarwal

05/27/2023, 9:06 AM
the metrics worth looking at will be busy threads on historical-0.
how many segments do you have in your cluster btw?
j

Jiaojiao Fu

05/27/2023, 9:20 AM
the historical overview metrics is normal,such as pod cpu/memory/iops . What metrics or log I should looking at?
The segments may 1.18million. we have 50 historical pods. Each pod has 2T volume, disk usage 80%
a

Abhishek Agarwal

05/27/2023, 9:40 AM
There is a metric to see how many jetty threads are busy. Whats the druid.server.http.numThreads set on your historical servers?
j

Jiaojiao Fu

05/27/2023, 9:56 AM
it set as 500.
Sorry, I made a mistake. The druid.server.http.numThreads in historical we set is 115, and during the period when the timed out, the value of Druid_Historical_ Jetty_numOpenConnection is always much larger than the set 115 After 05-24 16:00, the timed out occur very frequently. So we should increase the config druid.server.http.numThreads? And increase broker number maybe leads to this timed out?
a

Abhishek Agarwal

05/28/2023, 12:20 PM
yeah. I will suggest increasing this config. There is some advice here on how to tune the number of threads on historical. https://druid.apache.org/docs/latest/operations/basic-cluster-tuning.html#sizing-the-connection-pool-for-queries
j

Jiaojiao Fu

05/29/2023, 1:31 AM
Thank you so much for your guidance! I will try to decrease broker number and increase historical threads number. I want ask why the metadata query always timeout, but groupby queries seem success, there has priority in historical?
a

Abhishek Agarwal

05/29/2023, 5:13 AM
Once the system is stable and all brokers are up, do you still metadata queries taking longer?
j

Jiaojiao Fu

05/29/2023, 5:34 AM
Yes, there also have metadata refresh failed after broker startup. Also with historical timed out. log error:
<http://org.apache.druid.java.util.common.RE|org.apache.druid.java.util.common.RE>: Query[c05a44d0-71d4-4fc1-a47a-640e90650fec] url[<http://rt-druid-historical-26.rt-druid-historical.rt-druid.svc.cluster.local:8083/druid/v2/>] timed out.
	at org.apache.druid.client.DirectDruidClient$1.checkQueryTimeout(DirectDruidClient.java:415) ~[druid-server-0.17.1.jar:0.17.1]
	at org.apache.druid.client.DirectDruidClient$1.handleChunk(DirectDruidClient.java:307) ~[druid-server-0.17.1.jar:0.17.1]
	at org.apache.druid.java.util.http.client.NettyHttpClient$1.messageReceived(NettyHttpClient.java:249) [druid-core-0.17.1.jar:0.17.1]
	at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:119) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:485) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.handler.codec.http.HttpClientCodec.handleUpstream(HttpClientCodec.java:92) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.10.6.Final.jar:?]
	at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.10.6.Final.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]