What is the recommended approach for <https docs pinot apach Apache Pinot #general

What is the recommended approach for <batch ingest...

Priyank Bagrecha

06/14/2022, 7:24 PM

What is the recommended approach for batch ingestion of data from let's say either S3 or Hive into Pinot between minion based ingestion v/s ingestion jobs? Are there any pros / cons between the two?

Ken Krugler

06/14/2022, 9:18 PM

When you say “ingestion jobs”, are you talking about Spark/Hadoop, or using the stand-alone tool?

Priyank Bagrecha

06/14/2022, 9:25 PM

Good question - I will update my question to say minion based ingestion v/s standalone ingestion v/s spark. If it helps I am thinking of using Airflow to trigger a daily push.

Laxman Ch

06/15/2022, 3:17 AM

minion-based ingestion - did you mean Realtime to Offline conversion? If yes, this is called managed offline flow. This is relevant when you just want to compact and convert REALTIME segments to OFFLINE segments. Please read this for more details https://docs.pinot.apache.org/operators/operating-pinot/pinot-managed-offline-flows

Laxman Ch

06/15/2022, 3:24 AM

minion based (managed offline flow) vs hadoop/spark • With managed offline flow, you don't need a separate Hadoop/Spark cluster to prepare the pinot segment for the same data. Less operational overhead on the overall system. • managed offline flow designed to be single threaded per table. This may not scale for large workloads.

Priyank Bagrecha

06/15/2022, 5:18 PM

Hi Laxman, thanks for your response. I am focusing on ingestion of data for offline table(s) specifically.

Open in Slack

Previous Next