I got some problem with query result. I run the sa...
# troubleshooting
p
I got some problem with query result. I run the same query statement
Copy code
select count(*) from RuleLogsUAT
But got timeout & difference result per request. Here is the request log
Copy code
requestId=43,table=RuleLogsUAT,timeMs=10001,docs=210279/210377,entries=0/0,segments(queried/processed/matched/consuming/unavailable):21/18/18/2/0,consumingFreshnessTimeMs=1623319344461,servers=4/5,groupLimitReached=false,brokerReduceTimeMs=0,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs);pinot-server-1_R=1,-1,0,0;pinot-server-2_R=2,6,250,0;pinot-server-0_R=2,6,250,0;pinot-server-0_O=1,7,230,0;pinot-server-1_O=2,11,230,0,query=select count(*) from RuleLogsUAT,offlineThreadCpuTimeNs=1548682,realtimeThreadCpuTimeNs=1908620
requestId=44,table=RuleLogsUAT,timeMs=10002,docs=44807/44811,entries=0/0,segments(queried/processed/matched/consuming/unavailable):19/17/17/3/0,consumingFreshnessTimeMs=1623319158004,servers=4/5,groupLimitReached=false,brokerReduceTimeMs=0,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs);pinot-server-1_R=1,10,250,0;pinot-server-2_R=1,7,250,0;pinot-server-0_R=1,8,250,0;pinot-server-2_O=1,-1,0,0;pinot-server-1_O=1,11,230,0,query=select count(*) from RuleLogsUAT,offlineThreadCpuTimeNs=1211610,realtimeThreadCpuTimeNs=1412400
requestId=45,table=RuleLogsUAT,timeMs=10002,docs=73633/73731,entries=0/0,segments(queried/processed/matched/consuming/unavailable):23/20/20/3/0,consumingFreshnessTimeMs=1623319158007,servers=4/5,groupLimitReached=false,brokerReduceTimeMs=0,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs);pinot-server-1_R=0,5,250,0;pinot-server-2_R=0,8,250,0;pinot-server-0_R=0,8,250,0;pinot-server-0_O=0,7,230,0;pinot-server-1_O=1,-1,0,0,query=select count(*) from RuleLogsUAT,offlineThreadCpuTimeNs=655059,realtimeThreadCpuTimeNs=1141471
requestId=46,table=RuleLogsUAT,timeMs=10002,docs=44807/44811,entries=0/0,segments(queried/processed/matched/consuming/unavailable):19/17/17/3/0,consumingFreshnessTimeMs=1623319158004,servers=4/5,groupLimitReached=false,brokerReduceTimeMs=0,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs);pinot-server-1_R=1,5,250,0;pinot-server-2_R=1,7,250,0;pinot-server-0_R=1,6,250,0;pinot-server-2_O=1,-1,0,0;pinot-server-1_O=1,7,230,0,query=select count(*) from RuleLogsUAT,offlineThreadCpuTimeNs=417430,realtimeThreadCpuTimeNs=1821230
n
are all your servers healthy? only 4 servers are responding
servers=4/5
, hence the difference in result. You can check whats wrong with 1 server
looks like a different server everytime. are they restarting?
p
hmm, i only have 3 server, but idk why pinot query 4/5 server 🤔
Screen Shot 2021-06-11 at 9.22.11 AM.png
n
is this a hybrid table? have you tagged the servers as both _REALTIME and as _OFFLINE?
p
yes, it’s hybrid table
hmm, how
numServersQueried
was determine ? It’s zk configs ?
n
you probably have tagged your servers as both tenant_OFFLINE and tenant_REALTIME. Can you check in ZK browser or instance configs?
if you’ve tagged it as both, then it will get counted as once for realtime and once for offline
p
yes. my servers has both tags
DefaultTenant_OFFLINE, DefaultTenant_REALTIME
in realtime_table - segments exists in 3 server, in offline table - segments exists in 2 server. That’s reason why numServersQueried = 5. That’s ok
all my server is healthy, but why 1 server responding timeout 🤔
n
It's always either server 1 or server 2, the ones that have both. Any errors in those server logs? Are those server high on memory utilization?
@Mayank ^^ what might be some possible reasons those servers cannot do even a count * ?
p
Copy code
Timed out while polling results block, numBlocksMerged: 0 (query: QueryContext{_tableName='RuleLogsUAT_OFFLINE', _selectExpressions=[count(*)], _aliasList=[null], _filter=created_at_days_epoch <= '18761', _groupByExpressions=null, _havingFilter=null, _orderByExpressions=null, _limit=10, _offset=0, _queryOptions={responseFormat=sql, groupByMode=sql, timeoutMs=9998}, _debugOptions=null, _brokerRequest=BrokerRequest(querySource:QuerySource(tableName:RuleLogsUAT_OFFLINE), pinotQuery:PinotQuery(dataSource:DataSource(tableName:RuleLogsUAT_OFFLINE), selectList:[Expression(type:FUNCTION, functionCall:Function(operator:COUNT, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:*))]))], filterExpression:Expression(type:FUNCTION, functionCall:Function(operator:LESS_THAN_OR_EQUAL, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:created_at_days_epoch)), Expression(type:LITERAL, literal:<Literal stringValue:18761>)])), queryOptions:{responseFormat=sql, groupByMode=sql, timeoutMs=9998}))})
I think this is errors logs
it’s happen on server-1
m
Is this production? If so, why do you have both realtime and offline on same server?
p
nope, this’s non-prod env
m
One possibility is that your servers don't have enough threads to handle the offline+realtime request.
can you send the query to just
RuleLogsUAT_OFFLINE
or
RuleLogsUAT_REALTIME
?
n
can you add 2 servers and separate the offline and realtime? do you have too many segments on each server? and whats size of each segment?
p
Copy code
can you add 2 servers and separate the offline and realtime?
I don’t think it’s the problem
Copy code
do you have too many segments on each server? and whats size of each segment?
Its’s around 14 realtime segments & 8 offfline segments, size of each offline segments is around 400~700kb
m
select count(*) from RuleLogsUAT_OFFLINE
How many cores?
p
here’s the resources using
m
Is server-1 always the issue?
p
i think so
only server-1 has
Timed out while polling results block,
logging
Copy code
Timed out while polling results block, numBlocksMerged: 0 (query: QueryContext{_tableName='RuleLogsUAT_REALTIME', _selectExpressions=[count(*)], _aliasList=[null], _filter=null, _groupByExpressions=null, _havingFilter=null, _orderByExpressions=null, _limit=10, _offset=0, _queryOptions={responseFormat=sql, groupByMode=sql, timeoutMs=9999}, _debugOptions=null, _brokerRequest=BrokerRequest(querySource:QuerySource(tableName:RuleLogsUAT_REALTIME), pinotQuery:PinotQuery(dataSource:DataSource(tableName:RuleLogsUAT_REALTIME), selectList:[Expression(type:FUNCTION, functionCall:Function(operator:COUNT, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:*))]))], queryOptions:{responseFormat=sql, groupByMode=sql, timeoutMs=9999}))})
Copy code
Timed out while polling results block, numBlocksMerged: 0 (query: QueryContext{_tableName='RuleLogsUAT_OFFLINE', _selectExpressions=[count(*)], _aliasList=[null], _filter=null, _groupByExpressions=null, _havingFilter=null, _orderByExpressions=null, _limit=10, _offset=0, _queryOptions={responseFormat=sql, groupByMode=sql, timeoutMs=9999}, _debugOptions=null, _brokerRequest=BrokerRequest(querySource:QuerySource(tableName:RuleLogsUAT_OFFLINE), pinotQuery:PinotQuery(dataSource:DataSource(tableName:RuleLogsUAT_OFFLINE), selectList:[Expression(type:FUNCTION, functionCall:Function(operator:COUNT, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:*))]))], queryOptions:{responseFormat=sql, groupByMode=sql, timeoutMs=9999}))})
both query in realtime_server & offline_server
m
any errors/exceptions in server-1 log?
Clearly there's some issue there.
p
it’s weird, not all of query is timeout
m
What's the replication? If > 1, then queries that don't hit server-1 probably work.
p
replication = 2
m
yeah, so then not all queries may go to server-1 replica
also, what's the overall read qps to the cluster?
are you firing queries only manually, or is there more read qps coming from somewhere
p
only me using
so qps = 1
m
can you restart server-1?
p
okie
m
Something is wrong in that server, we will need to look at the logs and metrics (may be thread dump) to see what's happening in it.
perhaps take a threaddump before you restart?
p
well, the issue switch to server-2 when i restart server-1
Copy code
Timed out while polling results block, numBlocksMerged: 0 (query: QueryContext{_tableName='RuleLogsUAT_REALTIME', _selectExpressions=[*], _aliasList=[null], _filter=created_at_days_epoch > '18761', _groupByExpressions=null, _havingFilter=null, _orderByExpressions=null, _limit=10, _offset=0, _queryOptions={responseFormat=sql, groupByMode=sql, timeoutMs=9999}, _debugOptions=null, _brokerRequest=BrokerRequest(querySource:QuerySource(tableName:RuleLogsUAT_REALTIME), pinotQuery:PinotQuery(dataSource:DataSource(tableName:RuleLogsUAT_REALTIME), selectList:[Expression(type:IDENTIFIER, identifier:Identifier(name:*))], filterExpression:Expression(type:FUNCTION, functionCall:Function(operator:GREATER_THAN, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:created_at_days_epoch)), Expression(type:LITERAL, literal:<Literal stringValue:18761>)])), orderByList:[], limit:10, queryOptions:{responseFormat=sql, groupByMode=sql, timeoutMs=9999}))})
m
grep for "Driver, indexing time" in both server logs
Also paste broker log when query successful
Are you running master, or 0.7.1 release?
p
Here’s the logs server 1 & 2
Copy code
Are you running master, or 0.7.1 release?
I’m running on master
m
oh if master, then can you do a curl on the <server>:7500/appConfigs?
p
Copy code
# curl
curl: try 'curl --help' or 'curl --manual' for more information
# curl localhost:7500/appConfigs?
curl: (7) Failed to connect to localhost port 7500: Connection refused
I don’t think server expose port
n
Copy code
Is server-1 always the issue?
btw, it was not just server1. your log which you shared initially, showed both server1 and server 2 failing across the 5 queries from the start. So it could very well be because of offline realtime sharing the server. Restart is not what it made it go to server2.
p
Yes, i not just server 1. I just reinstall all cluster, keep volume mount. But the issue happen agains
Copy code
Timed out while polling results block, numBlocksMerged: 0 (query: QueryContext{_tableName='RuleLogsUAT_OFFLINE', _selectExpressions=[count(*)], _aliasList=[null], _filter=created_at_days_epoch <= '18761', _groupByExpressions=null, _havingFilter=null, _orderByExpressions=null, _limit=10, _offset=0, _queryOptions={responseFormat=sql, groupByMode=sql, timeoutMs=9999}, _debugOptions=null, _brokerRequest=BrokerRequest(querySource:QuerySource(tableName:RuleLogsUAT_OFFLINE), pinotQuery:PinotQuery(dataSource:DataSource(tableName:RuleLogsUAT_OFFLINE), selectList:[Expression(type:FUNCTION, functionCall:Function(operator:COUNT, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:*))]))], filterExpression:Expression(type:FUNCTION, functionCall:Function(operator:LESS_THAN_OR_EQUAL, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:created_at_days_epoch)), Expression(type:LITERAL, literal:<Literal stringValue:18761>)])), queryOptions:{responseFormat=sql, groupByMode=sql, timeoutMs=9999}))})
now it’s happend on server-0 😞