What is the recommended approach for <batch ingest...
# general
What is the recommended approach for batch ingestion of data from let's say either S3 or Hive into Pinot between minion based ingestion v/s ingestion jobs? Are there any pros / cons between the two?
When you say “ingestion jobs”, are you talking about Spark/Hadoop, or using the stand-alone tool?
Good question - I will update my question to say minion based ingestion v/s standalone ingestion v/s spark. If it helps I am thinking of using Airflow to trigger a daily push.
minion-based ingestion - did you mean Realtime to Offline conversion? If yes, this is called managed offline flow. This is relevant when you just want to compact and convert REALTIME segments to OFFLINE segments. Please read this for more details https://docs.pinot.apache.org/operators/operating-pinot/pinot-managed-offline-flows
minion based (managed offline flow) vs hadoop/spark • With managed offline flow, you don't need a separate Hadoop/Spark cluster to prepare the pinot segment for the same data. Less operational overhead on the overall system. • managed offline flow designed to be single threaded per table. This may not scale for large workloads.
Hi Laxman, thanks for your response. I am focusing on ingestion of data for offline table(s) specifically.