Hi All I m having some unexpected behaviour in terms of batc Apache Pinot #pinot-perf-tuning

Hi All, I'm having some unexpected behaviour in te...

Benjamin Greene

08/28/2024, 7:46 PM

Hi All, I'm having some unexpected behaviour in terms of batch ingestion. I'm using spark on EMR with s3 as our filepath. This is my run command: spark-submit -v \ --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \ --master yarn \ --deploy-mode cluster \ --conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins" \ --conf "spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3/pinot-batch-ingestion-spark-3-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-s3/pinot-s3-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-input-format/pinot-parquet/pinot-parquet-${PINOT_VERSION}-shaded.jar" \ --conf "spark.executor.extraClassPath=${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3/pinot-batch-ingestion-spark-3-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-s3/pinot-s3-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-input-format/pinot-parquet/pinot-parquet-${PINOT_VERSION}-shaded.jar" \ --jars "${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3/pinot-batch-ingestion-spark-3-${PINOT_VERSION}-shaded.jar,${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar,${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-s3/pinot-s3-${PINOT_VERSION}-shaded.jar" \ --files s3://our-s3-path/raw-pinot/spec_file/ingestion_job_spec.yaml \ local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \ -jobSpecFile ingestion_job_spec.yaml Currently the job succeeds and I'm able to generate 1 segment in approximately 20 seconds. However it seems that there is no impact from the segmentCreationJobParallelism flag in the ingestion job spec yaml file. Are there any additional steps I can take to increase the parallelism?

Benjamin Greene

08/28/2024, 8:10 PM

please disregard, I have identified that job parallelism in the job spec has no effect and it should be managed via executors in the submission. I will leave this thread here for anyone in the future. Thank you!

👍 2

Open in Slack

Previous Next