I ran a query designed to cause problems for the cluster `se Apache Pinot #troubleshooting

I ran a query designed to cause problems for the c...

Ken Krugler

02/19/2021, 11:05 PM

I ran a query designed to cause problems for the cluster (

select distinctcount(<super-high cardinality column>) from table

), and it did. The request timed out, even though I gave it a 100,000ms timeout, and now all queries (e.g. select * from crawldata limit 20) time out. I’ve looked at the controller/broker/sample of server logs, and don’t see any errors. In the broker log it looks like it’s getting no responses from servers:

Copy code

2021/02/19 22:21:53.860 INFO [BaseBrokerRequestHandler] [jersey-server-managed-async-executor-59] requestId=41163,table=crawldata_OFFLINE,timeMs=10000,docs=0/0,entries=0/0,segments(queried/processed/matched/consuming/unavailable):0/0/0/0/0,consumingFreshnessTimeMs=0,servers=0/5,groupLimitReached=false,brokerReduceTimeMs=0,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs);116.202.83.208_O=0,-1,0,0;168.119.147.123_O=0,-1,0,0;168.119.147.125_O=1,-1,0,0;168.119.147.124_O=1,-1,0,0;116.202.52.154_O=1,-1,0,0,query=select * from crawldata limit 20

But an example server log has this:

Copy code

2021/02/19 22:21:43.864 INFO [QueryScheduler] [pqr-11] Processed requestId=41163,table=crawldata_OFFLINE,segments(queried/processed/matched/consuming)=213/1/1/-1,schedulerWaitMs=0,reqDeserMs=0,totalExecMs=2,resSerMs=1,totalTimeMs=3,minConsumingFreshnessMs=-1,broker=Broker_168.119.147.124_8099,numDocsScanned=20,scanInFilter=0,scanPostFilter=620,sched=fcfs

Trying to figure out which process or processes are borked because of the query, and why. Any ideas? Thanks!

Xiang Fu

02/19/2021, 11:22 PM

I feel the server cpu is still on the heavy query @Jackie might have better understanding on the query lifecycle when query timed out

Ken Krugler

02/19/2021, 11:25 PM

Could be, but I’d expect the pinotServer.log file to contain some kind of timeout for the subsequent query, versus what looks like a reasonable entry (e.g.

numDocsScanned=20

, etc)

Xiang Fu

02/19/2021, 11:25 PM

it should be a timeout in the log unless the query is not stopped expectedly

Jackie

02/19/2021, 11:29 PM

I feel it might be the transport layer (netty) problem where servers are trying to serialize back too much data

Ken Krugler

02/19/2021, 11:30 PM

I do see that there’s one missing requestId in the logs, which I think was the problematic request.

Jackie

02/19/2021, 11:31 PM

It might somehow block the netty connection and cause the response for the second query not sending back

Ken Krugler

02/19/2021, 11:31 PM

For a

distinctcount

, I guess each server has to send back to the broker all unique values for the column, for every segment that it’s processing.

Jackie

02/19/2021, 11:31 PM

Server will combine all the values into a set, then send the set back

Ken Krugler

02/19/2021, 11:32 PM

Right, but each server then sends back its set to the broker, which has to combine to get the final count

Jackie

02/19/2021, 11:33 PM

Yes, broker has to merge all sets

Ken Krugler

02/19/2021, 11:37 PM

I just did a thread dump from one of the Pinot Server processes, and everything looks fine - no Pinot code running, nothing blocked.

Jackie

02/20/2021, 12:02 AM

Based on the log you posted, server side processed the second query without any issue, but broker didn't receive the response, and that's why I suspect something is broken in the transport layer. Maybe also check broker log to see if everything looks normal

Ken Krugler

02/20/2021, 12:19 AM

Broker logs look normal, at least to me. See my broker log entry in the initial question.

Jackie

02/20/2021, 12:40 AM

I feel the problem might be within the netty connection. Can you try restarting the broker and see if the problem is solved?

Jackie

02/20/2021, 12:45 AM

The symptom here is that servers do get the query, and send the response back, but broker somehow does not receive the responses

Ken Krugler

02/20/2021, 12:59 AM

@Jackie Restarting the broker worked, thanks!!

Ken Krugler

02/20/2021, 12:59 AM

But seems odd there were no errors in the broker log file. Should I file an issue about that?

Jackie

02/20/2021, 1:01 AM

Yes, please file an issue so we can track this, thanks

Jackie

02/20/2021, 1:02 AM

We rely on netty to transport data, maybe we hit some limitation in netty, but netty didn’t trigger the exception callback

Open in Slack

Previous Next