https://pinot.apache.org/ logo
e

Elon

12/24/2020, 6:01 AM
Hi, our brokers are taking 25s (max time) to return queries, but direct server queries return instantly. I took heap dumps, pmaps, etc. and the one thing that stands out is jstack output. Looks like HelixTaskExecutor threads are all waiting on the same object. Anyone ever see this behavior?
All waiting on
0x00000007188b2318
This is on all brokers
grizzly server is responsive, zk calls were coming back quickly, only thing that stood out is all HelixTaskExecutor threads waiting on the same object.
Looks like 1 server is in a gc loop, maybe the brokers are waiting on it? Strange thing is that via presto server requests return instantly
k

Kishore G

12/24/2020, 6:32 AM
Helixtaskexecutors are not in query path
Can you check broker log
e

Elon

12/24/2020, 6:55 AM
Seemed fine, anything in particular?
After restarting servers latency went back to 0
And I see the same HelixTaskExecutors in waiting state - looks like it was a red herring
I did notice from the heap dump on the server that we had a lot of those DirectR buffer refs
referring to mmapped segments
what would cause servers to be responsive to direct requests (i.e. AsyncQueryRequest) and unresponsive to broker requests?
k

Kishore G

12/24/2020, 7:21 AM
Can you find the queries with high latency from the logs?
e

Elon

12/24/2020, 7:51 AM
I see that not all the servers responded, ex. 2/3 servers responded in 25s
k

Kishore G

12/24/2020, 8:16 AM
So the problem is on server side
s

Subbu Subramaniam

12/24/2020, 4:20 PM
We have recently seen helix threads waiting for queue full in the server in Linked. These things lined up with other problems, so we were not sure if this was a helix issue or a pinot issue. I tend to believe that we have a pinot issue when trying to download segemnts from the controller, but I could be wrong. @Sidd investigated this most recently, but he is away on a holiday and may not respond until next year. We will watch out for more at our end. Meanwhile, is this reproducable? Is this offline only or offline+realtime or realtime only use case
e

Elon

12/24/2020, 4:58 PM
It was for all tables on the server. Seems to happen after the server is running for about 1 week or so. I don't think it was the helix threads, after restarting they all appear to be in
WAITING
state. All I see is that 1 server can do that to the entire cluster but when you submit an AsyncServerRequest via presto the servers respond instantly.
Thanks @Subbu Subramaniam! I hope you all have a great holiday!