Hi team, any thoughts/suggestion/preference on usi...
# general
a
Hi team, any thoughts/suggestion/preference on using pinot-minion framework vs spark+airflow for doing offline batch ingestion in pinot?
m
What’s your data size? And what kind of transformations do you do prior to pushing data to Pinot
a
These are daily/hourly partitioned tables which are quite big in GBs/PBs and present in parquet format but transformations required prior to Pinot are not very heavy (just dropping certain columns before ingesting into Pinot)
m
Theoretically both minion/spark are valid options. If you have PBs of data to be processed per job, may be start with spark though.
a
okay, also we want to schedule this job periodically to ingest data and #no. of such schedule could be high (hourly and assume 100s of different batch pipelines) so I thought of standard spark+airflow to do the job. I felt like - in minion framework, we are using pinot-controller as a scheduler which might have some impact on the overall cluster throughput if no. of jobs are high. Am I right to have that assumption?