https://pinot.apache.org/ logo
k

Ken Krugler

02/19/2021, 11:05 PM
I ran a query designed to cause problems for the cluster (
select distinctcount(<super-high cardinality column>) from table
), and it did. The request timed out, even though I gave it a 100,000ms timeout, and now all queries (e.g. select * from crawldata limit 20) time out. I’ve looked at the controller/broker/sample of server logs, and don’t see any errors. In the broker log it looks like it’s getting no responses from servers:
Copy code
2021/02/19 22:21:53.860 INFO [BaseBrokerRequestHandler] [jersey-server-managed-async-executor-59] requestId=41163,table=crawldata_OFFLINE,timeMs=10000,docs=0/0,entries=0/0,segments(queried/processed/matched/consuming/unavailable):0/0/0/0/0,consumingFreshnessTimeMs=0,servers=0/5,groupLimitReached=false,brokerReduceTimeMs=0,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs);116.202.83.208_O=0,-1,0,0;168.119.147.123_O=0,-1,0,0;168.119.147.125_O=1,-1,0,0;168.119.147.124_O=1,-1,0,0;116.202.52.154_O=1,-1,0,0,query=select * from crawldata limit 20
But an example server log has this:
Copy code
2021/02/19 22:21:43.864 INFO [QueryScheduler] [pqr-11] Processed requestId=41163,table=crawldata_OFFLINE,segments(queried/processed/matched/consuming)=213/1/1/-1,schedulerWaitMs=0,reqDeserMs=0,totalExecMs=2,resSerMs=1,totalTimeMs=3,minConsumingFreshnessMs=-1,broker=Broker_168.119.147.124_8099,numDocsScanned=20,scanInFilter=0,scanPostFilter=620,sched=fcfs
Trying to figure out which process or processes are borked because of the query, and why. Any ideas? Thanks!
x

Xiang Fu

02/19/2021, 11:22 PM
I feel the server cpu is still on the heavy query @Jackie might have better understanding on the query lifecycle when query timed out
k

Ken Krugler

02/19/2021, 11:25 PM
Could be, but I’d expect the pinotServer.log file to contain some kind of timeout for the subsequent query, versus what looks like a reasonable entry (e.g.
numDocsScanned=20
, etc)
x

Xiang Fu

02/19/2021, 11:25 PM
it should be a timeout in the log unless the query is not stopped expectedly
j

Jackie

02/19/2021, 11:29 PM
I feel it might be the transport layer (netty) problem where servers are trying to serialize back too much data
k

Ken Krugler

02/19/2021, 11:30 PM
I do see that there’s one missing requestId in the logs, which I think was the problematic request.
j

Jackie

02/19/2021, 11:31 PM
It might somehow block the netty connection and cause the response for the second query not sending back
k

Ken Krugler

02/19/2021, 11:31 PM
For a
distinctcount
, I guess each server has to send back to the broker all unique values for the column, for every segment that it’s processing.
j

Jackie

02/19/2021, 11:31 PM
Server will combine all the values into a set, then send the set back
k

Ken Krugler

02/19/2021, 11:32 PM
Right, but each server then sends back its set to the broker, which has to combine to get the final count
j

Jackie

02/19/2021, 11:33 PM
Yes, broker has to merge all sets
k

Ken Krugler

02/19/2021, 11:37 PM
I just did a thread dump from one of the Pinot Server processes, and everything looks fine - no Pinot code running, nothing blocked.
j

Jackie

02/20/2021, 12:02 AM
Based on the log you posted, server side processed the second query without any issue, but broker didn't receive the response, and that's why I suspect something is broken in the transport layer. Maybe also check broker log to see if everything looks normal
k

Ken Krugler

02/20/2021, 12:19 AM
Broker logs look normal, at least to me. See my broker log entry in the initial question.
j

Jackie

02/20/2021, 12:40 AM
I feel the problem might be within the netty connection. Can you try restarting the broker and see if the problem is solved?
The symptom here is that servers do get the query, and send the response back, but broker somehow does not receive the responses
k

Ken Krugler

02/20/2021, 12:59 AM
@Jackie Restarting the broker worked, thanks!!
But seems odd there were no errors in the broker log file. Should I file an issue about that?
j

Jackie

02/20/2021, 1:01 AM
Yes, please file an issue so we can track this, thanks
We rely on netty to transport data, maybe we hit some limitation in netty, but netty didn’t trigger the exception callback