https://pinot.apache.org/ logo
#pinot-dev
Title
# pinot-dev
s

suraj kamath

07/22/2021, 2:03 PM
Hi I'm looking at the use case of reading the pinot segments from a realtime table in an Apache spark/ Apache Flink job . The objective of the spark job is to 1. Get/download the segments of a realtime table 2. Read the segments and convert to JSON I was wondering if there's a plugin or library(jar) of sorts that I can use to achieve this ? Any pointers would be helpful. Thanks
m

Mayank

07/22/2021, 2:08 PM
Hi @User curious to learn on what's the motivation behind this? Wouldn't it be easier to simply ETL the realtime events onto data lake (without having to go through pinot, just for the ETL purpose)?
As in - the same events that are getting ingested into Pinot can also directly go to your data lake?
s

suraj kamath

07/22/2021, 2:18 PM
Hi @User, we are exploring the possibility of moving the segments from realtime table to offline table. While Minions are the advised way of achieving this, I also wanted to try whether this was achievable through Apache spark.
m

Mayank

07/22/2021, 2:30 PM
Hi @User You could still use Apache spark on raw data on your data lake and push to offline table. The data does not need to come from donwloading Pinot segments and converting them to JSON. It can still be directly ETL'd from your realtime stream system on to data lake.
Does that make sense?
s

suraj kamath

07/22/2021, 4:26 PM
Sure. @User.. what i am actually looking for is a way to move the segments from a real time table to offline table without the use of Minion.
While we most probably will be going with Minion for this , I also wanted to explore the feasibility of using spark to achieve this.
m

Mayank

07/22/2021, 4:28 PM
So that I understand, is what you are trying to test that given raw data, can you push using spark? Or are you trying to test explicitly moving realtime segments to offline manually?
s

suraj kamath

07/22/2021, 4:29 PM
moving realtime segments to offline manually
m

Mayank

07/22/2021, 4:29 PM
Ok, that's where I am unable to see the motivation.
I can totally see the need for testing - given raw data in data lake, how to push to offline table
This is a more popular pattern, than downloading from realtime - converting to raw json - regenerate segment - push to offline.