Hi, my team are arguing about if pinot table parti...
# troubleshooting
a
Hi, my team are arguing about if pinot table partition config depends on kafka topic partition policy. We’re using lowlevel type to consume kafka stream data. According to this doc(the first picture), it doesn’t depend on kafka partition policy. But according to another doc(the second picture), it does depend on kafka partition policy. Could somebody help make it clear?
k
It does depend on Kafka partition function which is typically murmur hash
👍 1
This is for real time stream ingestion
a
I’m using RealtimeToOfflineSegmentsTask in my realtime table config. So is there any suggestion on configuring Offline table to use replica group instance assignment or other method that could improve query performance?
k
What’s the qps
a
currently 100, expect much higher
k
Then partitioning plus sorting within each segment will definitely help
Replica group can be added dynamically later on
m
+1
a
I’m not very clear what does ‘partitioning plus sorting’ mean.😅
m
So let’s say you have a dimension in your schema which appears in all/most queries like
where userId = xxx
, then it is likely a good candidate to sort/partition your data on that dimension.
For realtime: You just set the config
sortedColumn
in table-config. And need to set partition function in upstream (eg Kafka) and Pinot to matching implementation (murmur2).
For offline: You currently need to sort the data outside of Pinot (minion or otherwise). Partition function has to match as well.
Do you have such a dimension?
a
yes, we have such a dimension, like userId, and used in most queries
m
Yes then that is the one to sort and partition on
a
I think I can use sort index in my case. Due to some reason, upstream partition function is not implemented based on this dimension.
m
Partition definitely helps if you want to do thousands of read qps.
What’s the read qps
a
pretty low, 100 currently. Expect query performance improved to allow higher qps.
m
What’s the final target qps and p99 latency you need
a
Not very sure now. Just try to figure out how to tune query performance with limited resources and what I could do to achieve it.🤣
m
Ok.