What is the recommended approach for batch ingestion of data from let's say either S3 or Hive into Pinot between minion based ingestion v/s ingestion jobs? Are there any pros / cons between the two?
k
Ken Krugler
06/14/2022, 9:18 PM
When you say “ingestion jobs”, are you talking about Spark/Hadoop, or using the stand-alone tool?
p
Priyank Bagrecha
06/14/2022, 9:25 PM
Good question - I will update my question to say minion based ingestion v/s standalone ingestion v/s spark. If it helps I am thinking of using Airflow to trigger a daily push.
minion based (managed offline flow) vs hadoop/spark
• With managed offline flow, you don't need a separate Hadoop/Spark cluster to prepare the pinot segment for the same data. Less operational overhead on the overall system.
• managed offline flow designed to be single threaded per table. This may not scale for large workloads.
p
Priyank Bagrecha
06/15/2022, 5:18 PM
Hi Laxman, thanks for your response. I am focusing on ingestion of data for offline table(s) specifically.