Hi team, I would like to get some suggestions abou...
# general
g
Hi team, I would like to get some suggestions about what does the pinot batch ingestion story look like in Production environment. Ideally we want to use spark cluster mode for ingestion in production, but we ran into lots of issue when submitting job in distributed fashion to our production spark clusters on yarn. Currently we only have spark local mode and pinot standalone ingestion working for batch data, but we are worried this will not be sustainable for ingesting larger production tables. What do people generally use for ingesting pinot data in production? Asking because I don’t see too much documentation and discussion around using spark generation job with yarn master and cluster deploy mode. Besides, we are at hadoop 2.9.1, spark 2.4.6 on yarn, pinot 0.9.2, also interested to know if anyone has successfully set up cluster mode batch ingestion with similar hadoop/spark environment👀.
k
Hi @User - we run Flink jobs to build per-segment CSV files (saved in HDFS), then run the Pinot Hadoop MapReduce segment generation job to build segments. Both of these workflows are running on Yarn. The resulting segments are stored in HDFS, in our configured “deep store” location. Then we run an admin job (single server) to do a metadata push of the new segments. This has been in production for a few months. We use Airflow to orchestrate everything. Just FYI, now that support has been added for a Flink segment writer “sink”, we’ll be able to skip the intermediate CSV file generation step.
👍 1
g
@User Thanks a lot for the info! Would you mind sharing what’s your flink version and hadoop version?
k
Flink 1.13.2 (transitioning to 1.14.3) and Hadoop 2.8.5
👍 1
thankyou 1
s
Thanks @User for the details. when you say metadata push of the new segments, do we have a way to get only the metadata for each segment and make a call to Pinot? thanks for your attention.
k
A “metadata push” is a kind of data ingestion job that you submit to the Pinot admin tool. See https://docs.pinot.apache.org/basics/data-import/batch-ingestion#ingestion-jobs for details. You specify a directory containing segments, and information about the controller, the table name, etc. via the ingestion job spec (a yaml file).
s
This is helpful Ken. Thanks for the reference @User! 🙂