I m about to start writing a daily mapreduce to prepare segm Apache Pinot #troubleshooting

I'm about to start writing a daily mapreduce to pr...

Dan Hill

08/05/2020, 5:27 AM

I'm about to start writing a daily mapreduce to prepare segments for offline ingestion. I'm using Flink for the streaming ingestion. Any design tips for using Flink? • Should I have Flink write the files to S3 and then run LaunchDataIngestionJob using a workflow tool? • What's the status of the batch plugins? Does this make it easy to encapsulate the client-side parts of LaunchDataIngestionJob? https://docs.pinot.apache.org/plugins/pinot-batch-ingestion I'm also fine with writing this in Spark if it makes it a lot easier. I'd prefer Flink to keep the implementation consistent.

Kishore G

08/05/2020, 5:29 AM

Thats one option. We should have a job that automatically does this from real-time to offline tables this quarter.

Kishore G

08/05/2020, 5:30 AM

Spark is probably better

Dan Hill

08/05/2020, 5:31 AM

Sweet!

Dan Hill

08/05/2020, 5:31 AM

Are there any example Spark projects that write to Pinot?

Kishore G

08/05/2020, 5:35 AM

https://docs.pinot.apache.org/operators/tutorials/batch-data-ingestion-in-practice#executing-the-job-using-spark

👍 1

Dan Hill

08/05/2020, 5:11 PM

I'm not as familiar with Spark. It looks like the examples are separate Spark Jobs. Does Spark come with a workflow to make it easy to run the ingestion job after doing other batch processing? Or do people usually use a tool like Apache Airflow, Oozie, etc to do this?

Open in Slack

Previous Next