Hi I m looking at the use case of reading the pinot segments Apache Pinot #pinot-dev

Hi I'm looking at the use case of reading the pino...

suraj kamath

07/22/2021, 2:03 PM

Hi I'm looking at the use case of reading the pinot segments from a realtime table in an Apache spark/ Apache Flink job . The objective of the spark job is to 1. Get/download the segments of a realtime table 2. Read the segments and convert to JSON I was wondering if there's a plugin or library(jar) of sorts that I can use to achieve this ? Any pointers would be helpful. Thanks

Mayank

07/22/2021, 2:08 PM

Hi @User curious to learn on what's the motivation behind this? Wouldn't it be easier to simply ETL the realtime events onto data lake (without having to go through pinot, just for the ETL purpose)?

Mayank

07/22/2021, 2:08 PM

As in - the same events that are getting ingested into Pinot can also directly go to your data lake?

suraj kamath

07/22/2021, 2:18 PM

Hi @User, we are exploring the possibility of moving the segments from realtime table to offline table. While Minions are the advised way of achieving this, I also wanted to try whether this was achievable through Apache spark.

Mayank

07/22/2021, 2:30 PM

Hi @User You could still use Apache spark on raw data on your data lake and push to offline table. The data does not need to come from donwloading Pinot segments and converting them to JSON. It can still be directly ETL'd from your realtime stream system on to data lake.

Mayank

07/22/2021, 2:30 PM

Does that make sense?

suraj kamath

07/22/2021, 4:26 PM

Sure. @User.. what i am actually looking for is a way to move the segments from a real time table to offline table without the use of Minion.

suraj kamath

07/22/2021, 4:27 PM

While we most probably will be going with Minion for this , I also wanted to explore the feasibility of using spark to achieve this.

Mayank

07/22/2021, 4:28 PM

So that I understand, is what you are trying to test that given raw data, can you push using spark? Or are you trying to test explicitly moving realtime segments to offline manually?

suraj kamath

07/22/2021, 4:29 PM

moving realtime segments to offline manually

Mayank

07/22/2021, 4:29 PM

Ok, that's where I am unable to see the motivation.

Mayank

07/22/2021, 4:29 PM

I can totally see the need for testing - given raw data in data lake, how to push to offline table

Mayank

07/22/2021, 4:30 PM

This is a more popular pattern, than downloading from realtime - converting to raw json - regenerate segment - push to offline.

Open in Slack

Previous Next