hello, i could use some help with query tuning: ``...
# troubleshooting
x
hello, i could use some help with query tuning:
Copy code
# schema:
user(int) | location(int) | time(long)

# 1st query (filter):
"select user, count(*) from {table} where time between {start} and {end} and location between 500 and 550 group by user having count(user) >= 24 limit 1000000"

# 2nd query (combiner):
"select time, count(distinct(user)) as count from {table} where user in ({users}) and time between {start} and {end} and location between 300 and 350 group by time limit 10000000"
the query time scales linearly with the number of selected userids from the first query
segment size = 300mb number of records = ~500 million per day (10 segments) using the python client takes around:
Copy code
query took 1.5208384820143692
found 132164 userids
query took 1.2651426389929838
using the java client, with roaring bitmaps to serialize userids from the first query takes longer:
Copy code
query took 1.601
found 132164 userids
query took 2.586
both clients were measured solely on the execution time of the pinot query
i don’t use the
in_subquery
or
in_id_set
construct from https://docs.google.com/document/d/1s6DZ9eTPqH7vaKQlPjKiWb_OBC3hkkEGICIzcd5gozc/edit# because i was advised by @Jackie to build the idset client side due to the nature of userid filtering in my first query
broker logs:
Copy code
timeMs=1159,
docs=5175479/449912692,
entries=456484558/5175479,
segments(queried/processed/matched/consuming/unavailable):10/10/10/0/0,consumingFreshnessTimeMs=0,
servers=10/10,
groupLimitReached=false,
brokerReduceTimeMs=516,
exceptions=0,
serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs,RequestSentDelayMs);
pinot-server-3_O=0,592,382743,0,-1;
pinot-server-5_O=0,580,373923,0,-1;
pinot-server-1_O=0,573,372135,0,-1;
pinot-server-7_O=0,583,378483,0,-1;
pinot-server-0_O=0,579,374595,0,-1;
pinot-server-6_O=0,640,379359,0,-1;
pinot-server-9_O=0,589,386235,0,-1;
pinot-server-4_O=0,517,381771,0,-1;
pinot-server-2_O=0,542,383007,0,1;
pinot-server-8_O=0,590,379035,0,-1,
offlineThreadCpuTimeNs=5758172708,
realtimeThreadCpuTimeNs=0
m
What’s the latency without having clause
Also what’s the indexing you have right now
x
i don’t see anything unusual on the grafana metrics - • pinot cluster is deployed on k8s • server pods:
4000m cpu, 10GB mem
• broker pods:
1000m cpu, 1GB mem
• requests and limits are the same for both pods, as followed the default metrics in https://docs.pinot.apache.org/operators/tutorials/deployment-pinot-on-kubernetes
indexing:
Copy code
dictionary | forward-index | inverted-index
user          Y                Y             Y
location      Y                Y             Y
time          Y                Y             Y
segments were generated from parquet files that are physically sorted by user, then time
i don’t know if star tree index will help here, since the doc says
Unsupported functions
DISTINCT_COUNT
Intermediate result Set is unbounded
j
Broker reduce takes quite long
x
What’s the latency without having clause
checking now
j
One think that can potentially help is to add an order by on count for the first query and reduce the limit
With such high limit, broker might run into heavy GCs
x
latency without HAVING clause is just as bad, ~2s with brokerReduceTimeMs=~700
query time also can vary between 1.5s to 3s for either query 1 or query 2
when you say “add an order by on count”, do you mean this?
Copy code
# 1st query (old):
"select user, count(*) from {table} where time between {start} and {end} and location between 500 and 550 group by user having count(user) >= 24 limit 1000000"

# 1st query (new):
"select user, count(*) as count from {table} where time between {start} and {end} and location between 500 and 550 group by user having count(user) >= 24 order by count limit 1000000"
i’ve tried that, and also reducing the limit (to a point where i dont get truncated results). still slow 😕
cardinality of the items in schema:
Copy code
users = ~5-6 million
time = 96
cells = ~2500
here are other types of queries that scale linearly with the amount of users fetched from the first step:
Copy code
# get relevant users
...

# step 2 (reducer type 2)
"select location, count(user) as count from {table} where user in ({users}) and time between {start} and {end} group by location limit 10000000"

# step 2 (reducer type 3)
"select count(user) from {table} where user in ({users}) and time_rounded between {start} and {end} and location between 300 and 505 group by user limit 10000000"

# step 2 (reducer type 4)
"select location, count(distinct(user)) from {table} where user in ({users}) and time_rounded between {start} and {end} and location between 300 and 310 group by location limit 10000000"
would making segments smaller so that the computation can be further parallelized help? they’re currently at 300mb untarred/uncompressed per segment, or would throwing more resources at the broker help?
hi @Mayank @Jackie @Xiang Fu, any suggestions based on the info that i’ve provided? • nature of the queries • broker/servers resources • broker/servers logs • data volume • schema and cardinality • indexes • using java client with roaring bitmap idset / python client
what else should i be looking at?
m
What does this 1000cpu mean -
broker pods: 1000cpu, 1GB mem
?
x
sorry, it should
1000cpu
be
1000m
- edited the original
m
Still don't know how to read it, is it 1 core?
x
ah yes. 1000m = 1virtual cpu in k8s notation
m
Also, in terms of your query, are you trying to intersect user sets that were between location A and B? If so, if you are ok with approximations, then look a this video on Theta-sketches:

https://www.youtube.com/watch?v=HlVZjUCBabk&t=2652s

x
unfortunately i need to have the exact number. precision is important here
i just had another thought - is parsing an extremely long query string a cause for the slowness? e.g. if my 1st query yields 100k userids, putting these 100k userids in the 2nd query’s IN clause results in a sql string length: • python client, just appending ints together:
1,299,260
chars • java client, using roaring bitmap as a base64 string
353,755
chars
m
brokerReduceTimeMs=516,
-> This is the reduce phase in broker that happens after getting results from servers, so outside of parsing
If you want to improve broker perf, you need to give it more CPU and memory. The reduce phase can happen in parallel if there are multiple servers' responses available to reduce on, and the threads can get CPU
x
i can try that, although i see that CPU utilization for both brokers and servers are below 10%
m
This is a CPU intensive query, so that implies that you are becoming memory/io bound. Another possibility is that there's CPU contention, but not obvious to me at this time.