How can we make sure partitions are balanced? e.g....
# general
s
How can we make sure partitions are balanced? e.g. if I partition on a column, and some values are much more frequent than other values. which
functionName
should I use? And also how should I decide the
numPartitions
?
x
I think there is murmur function
"segmentPartitionConfig": { "columnPartitionMap": { "userId": { "functionName": "Murmur", "numPartitions": 128 } } },
Sample configs
s
Murmur is just a hash function, right? How does it address the balance issue? Pinot doc reads “Pinot currently supports
Modulo
,
Murmur
,
ByteArray
and
HashCode
hash functions.” but does not have any details.
Also, are you suggesting to use a very large
numPartitions
like 128? Isn’t this going to cause most partitions tiny and the imbalance issue worse?
x
ic, this function is not intended to make the partition balance
you can use smaller partition numbers
in your case the data may still follow the original distribution
the purpose of the function is just to render a fixed partition number for the event
since pinot segments are bounded by size
so if one segment is filled up then it will be sealed and the next segment is created.
This won’t impact consuming, but may impact the query time performance
if your query hits the large partition
s
IIUC you are saying that if segment size is small enough this imbalance may not be a big issue? What if we get unlucky and two large parts get hashed into the same partition? Is there a way to avoid that?
A related question: what is the difference between
replicasPerPartition
and
replication
in
segmentsConfig
? And what’s the guideline/tradeoff here to set a good value?
x
in that case, some partitions will just move much faster than other partitions
you will pay the query penalty that those two parts will always be queries together
and you can avoid that only by repartition.
replicasPerPartition
 and 
replication
should be same, I recommend to set it to 2 to 3 based on your deployment and query load.
👍 1