Hello team :wave: . I have a problem that I'm tryi...
# troubleshooting
l
Hello team 👋 . I have a problem that I'm trying to think through. I added a bunch of thoughts around it in my thread above, but I'll summarise under this thread. Any help appreciated.
Consider the following segment partitioning configuration
Copy code
"segmentPartitionConfig": {
        "columnPartitionMap": {
          "customerId": {
            "functionName": "Murmur",
            "numPartitions": 10
          }
        }
      }
customerId gets logically partitioned into 10 buckets, but in my case, some of these buckets are hotter than others, as some customers produce a lot more data than others. This means that during batch ingestion, the entire pipeline gets congested (using flink, the flink connector can only do one segment at a time currently). Instead of having a 1:1 relationship of Partition -> Sink, I could have multiple sinks per partition. I am currently experimenting with this using the following approach. Let's say I have 10 partitions, and 40 sinks. With this configuration, I could assign 4 sinks per actual segment partition. The way to achieve this would be
Copy code
var sinksPerPartition = 40 / 10; // 4
var primaryPartition = murmur2(document.getCustomerId()) % 10; // 0 - 9
var secondaryPartition = murmur2(document.getRecordId()) % sinksPerPartition;

var finalPartition = primaryPartition * sinksPerPartition + secondaryPartition;
This would assign 4 sinks per actual partition, allowing me to scale ingestion for large customers. However, there is one problem, and that is that for small customers, the amount of records per segment might end up very small. First of all, would this approach work in the first place? I guess even though this may work, it probably would be better to change the flink sink (https://github.com/apache/pinot/tree/master/pinot-connectors/pinot-flink-connector) to be able to build multiple segments concurrently, as this would have virtually the same benefit without having the limitation of potentially ending up with very small segments for some partitions.
Any thoughts on this?
m
Interesting. Both approaches make sense to me.