Hello team wave I have a problem that I m trying to think th Apache Pinot #troubleshooting

Hello team :wave: . I have a problem that I'm tryi...

Lars-Kristian Svenøy

05/18/2022, 10:44 AM

Hello team 👋 . I have a problem that I'm trying to think through. I added a bunch of thoughts around it in my thread above, but I'll summarise under this thread. Any help appreciated.

Lars-Kristian Svenøy

05/18/2022, 10:45 AM

Consider the following segment partitioning configuration

Copy code

"segmentPartitionConfig": {
        "columnPartitionMap": {
          "customerId": {
            "functionName": "Murmur",
            "numPartitions": 10
          }
        }
      }

customerId gets logically partitioned into 10 buckets, but in my case, some of these buckets are hotter than others, as some customers produce a lot more data than others. This means that during batch ingestion, the entire pipeline gets congested (using flink, the flink connector can only do one segment at a time currently). Instead of having a 1:1 relationship of Partition -> Sink, I could have multiple sinks per partition. I am currently experimenting with this using the following approach. Let's say I have 10 partitions, and 40 sinks. With this configuration, I could assign 4 sinks per actual segment partition. The way to achieve this would be

Copy code

var sinksPerPartition = 40 / 10; // 4
var primaryPartition = murmur2(document.getCustomerId()) % 10; // 0 - 9
var secondaryPartition = murmur2(document.getRecordId()) % sinksPerPartition;

var finalPartition = primaryPartition * sinksPerPartition + secondaryPartition;

This would assign 4 sinks per actual partition, allowing me to scale ingestion for large customers. However, there is one problem, and that is that for small customers, the amount of records per segment might end up very small. First of all, would this approach work in the first place? I guess even though this may work, it probably would be better to change the flink sink (https://github.com/apache/pinot/tree/master/pinot-connectors/pinot-flink-connector) to be able to build multiple segments concurrently, as this would have virtually the same benefit without having the limitation of potentially ending up with very small segments for some partitions.

Lars-Kristian Svenøy

05/18/2022, 10:45 AM

Any thoughts on this?

Mayank

05/18/2022, 2:04 PM

Interesting. Both approaches make sense to me.

Open in Slack

Previous Next