Hi team, can Pinot ingest one partition of one Kaf...
# general
a
Hi team, can Pinot ingest one partition of one Kafka topic which has many partition?
m
If you mean you want to ignore all other partitions, you might be able to achieve it by having filter functions in ingestion. However, this is not a setup that I would recommend. What’s the use case?
a
yes, like I can specify a partition number and Pinot can ingest just that partition and save the data to one table.
Use case is different business data are partitioned to different Kafka partitions.
j
you can also filter on
$segmentName
which has some convention to put the partition number in the segment name, but i doubt it’s fast enough.
m
Right, but what’s the business case where you want to do this? Note, Pinot scales ingestion by having parallel consumers (one per partition). If you just have one partition to consume from you will lose out on that.
I’d not recommend relying on the
$segmentName
convention, as there is no contract to keep it consistent in future.
a
Multi Business data is merged and written to one Kafka topic and We depend on Pinot to separate data into different tables. Hence this topic is consumed repeatedly and it’s resource consuming.
Is there any method to ingest one Kafka topic and write data into different Pinot tables?
m
What is the total data size across all partitions (per day)?
a
That’s a key point. Maybe 0.15 billion per day.
m
0.15 B rows?
a
yes
m
You can just have a single table partitioned by org. It will be much simpler.
You can configure same partition function in Pinot and Kafka, and pinot will only look at the partition for the query, and hence scalable
also avoids operations overhead of multiple tables
150M rows per day is not a big deal
a
I’m afraid one table is not ok. Different business data has different fields.
m
You can create multiple tables and have filter function in ingestion that each table only ingests one transform, that is definitely possible. But like I mentioned, it is not the most ideal/cost-effective solution. Is there an option to repartition upstream?
a
I see. You can create multiple tables and have filter function in ingestion that each table only ingests one transform, that is definitely possible. Like you mentioned here, that’s what we want to do and considering the cost I had the question here.
About repartition upstream, what’s your suggestion?😇
m
You could have a stream processing upstream that takes your kafka stream, and splits it into multiple topics (one per org) with partitions within each partition
a
Thanks, Mayank. I like your idea and it makes things much easier.
k
Btw, we did some changes when we added the support for kinesis.. those changes will actually make it easier to add support to consume a subset of partitions..
I can point you to the code if you want to contribute this feature
a
Sure, I’d like to give it a try.