Couple of Starter Question: Lets assume we have a...
# general
a
Couple of Starter Question: Lets assume we have an table in HDFS which get loads every 30 min with following structure. e.g: /tmp/event/dt=2021-01-01/batch_id=2021-01-01_01_00_00 1. How do we incremental load the data in Pinot, atomically ? 2. Let’s assume we have to fix historical data. How do we reload the older batch (which is already loaded into Pinot) for e.g: /tmp/event/dt=2020-01-01/batch_id=2020-01-01_01_00_00 ? 3. Is there a way where we can directly build Pinot Segment from Spark DataFrame, is there any specific Implementation interface i can use in our exiting Spark App ?
m
1. You can have a batch job that scheduled that incrementally pushed data to Pinot, as data arrives in HDFS. Curious though, if it is every 30min, do you have a stream pipeline that Pinot can ingest from directly? 2. Historical segments can be overwritten in Pinot. Any segment pushed to Pinot that has the same name as an existing segment within Pinot will overwrite the existing one. you just need to ensure that they are for the same time period. 3. Haven’t looked at spark data frame, but for segment generation from any format you just need to implement the RecordReader interface. @User do we have this in OSS?
j
Yes, we’ve already had some spark job which is ready in our OSS. @User you can check
PinotSparkJobLauncher
or
IntermediateSegmentTest
to see how it’s going to be used in your spark App
thankyou 2