One basic question about data preparation for ingestion into Apache Pinot #general

One basic question about data preparation (for ing...

Ralph Debusmann

05/25/2022, 6:59 AM

One basic question about data preparation (for ingestion into Pinot) - how do you combine e.g. multiple Kafka topics into one table in Pinot so that you can query them as one - without having JOINs? Is there any way to do it without heavy upfront stream processing using e.g. Kafka Streams/ksqlDB/Flink/Materialize etc.?

Kishore G

05/25/2022, 1:07 PM

Do you want to simply union thebtwo streams or perform some kind of join across the two streams

Ralph Debusmann

05/25/2022, 1:10 PM

A union would be a good start (basically putting a bunch of Kafka topics into one table in Pinot), of course some kind of join would be even better. I'd just like to avoid having to use stream processing for this and just pull the data from various Kafka topics into Pinot and go from there 🙂

Ralph Debusmann

05/25/2022, 1:11 PM

How is this done in LinkedIn for example?

Kishore G

05/25/2022, 1:18 PM

It’s a samza job that does join and writes back to kafka

➕ 1

Ralph Debusmann

05/25/2022, 2:55 PM

Thanks! And what if I don't want to add a stream processing component to my architecture - what options would you recommend?

Kishore G

05/25/2022, 3:10 PM

Depends on is it a join or simple union of two topics

Ralph Debusmann

05/25/2022, 5:02 PM

So in our case we have e.g. one topic of daily aggregated Twitter sentiments and one topic of daily aggregated copper prices (simplified example). It could be that one of the time series has a different starting point compared to the other - e.g. the Twitter sentiments would start in 2015 and the copper prices in 1990. Would it be possible to bring the data starting from 2015 together into one Pinot table with the union operation?

Kishore G

05/25/2022, 5:43 PM

You can write a plug-in that is a composite consumer across multiple topics

Ralph Debusmann

05/26/2022, 12:14 AM

Cool thanks - I'll try that 😀

Kishore G

05/26/2022, 12:48 AM

happy to help if you can share the PR or a github.

Ralph Debusmann

06/02/2022, 11:32 AM

We're not yet there - I first have to drag parts of our team into the Kafka rabbit hole to get the tweets and reddit posts etc. on Kafka, then we have to bring it together with the commodity price data...

Ralph Debusmann

06/02/2022, 11:33 AM

But it would really be super helpful if you could somehow read multiple topics into one Pinot table without having to add another moving part (=Flink, ksqlDB, Decodable, Materialize...) to the platform stack...

Ralph Debusmann

06/02/2022, 11:35 AM

Let alone having some maybe very restricted join functionality in Pinot - like Rockset does (even though I don't think they can keep the same latency + concurrency guarantees that you can)...

Open in Slack

Previous Next