One basic question about data preparation (for ing...
# general
r
One basic question about data preparation (for ingestion into Pinot) - how do you combine e.g. multiple Kafka topics into one table in Pinot so that you can query them as one - without having JOINs? Is there any way to do it without heavy upfront stream processing using e.g. Kafka Streams/ksqlDB/Flink/Materialize etc.?
k
Do you want to simply union thebtwo streams or perform some kind of join across the two streams
r
A union would be a good start (basically putting a bunch of Kafka topics into one table in Pinot), of course some kind of join would be even better. I'd just like to avoid having to use stream processing for this and just pull the data from various Kafka topics into Pinot and go from there πŸ™‚
How is this done in LinkedIn for example?
k
It’s a samza job that does join and writes back to kafka
βž• 1
r
Thanks! And what if I don't want to add a stream processing component to my architecture - what options would you recommend?
k
Depends on is it a join or simple union of two topics
r
So in our case we have e.g. one topic of daily aggregated Twitter sentiments and one topic of daily aggregated copper prices (simplified example). It could be that one of the time series has a different starting point compared to the other - e.g. the Twitter sentiments would start in 2015 and the copper prices in 1990. Would it be possible to bring the data starting from 2015 together into one Pinot table with the union operation?
k
You can write a plug-in that is a composite consumer across multiple topics
r
Cool thanks - I'll try that πŸ˜€
k
happy to help if you can share the PR or a github.
r
We're not yet there - I first have to drag parts of our team into the Kafka rabbit hole to get the tweets and reddit posts etc. on Kafka, then we have to bring it together with the commodity price data...
But it would really be super helpful if you could somehow read multiple topics into one Pinot table without having to add another moving part (=Flink, ksqlDB, Decodable, Materialize...) to the platform stack...
Let alone having some maybe very restricted join functionality in Pinot - like Rockset does (even though I don't think they can keep the same latency + concurrency guarantees that you can)...