I am running below spark command in cluster mode…its taking too long in last step to copy files from staging to output directory and it is doing one file at a time.. any suggestion on how to improve the performance as for 8000 files it taking more than 10 hours just in last step from staging to output directory..
spark-submit --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand --master yarn --deploy-mode cluster --conf
spark.yarn.am.waitTime=1000s --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true --conf parquet.enable.summary-metadata=false --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf spark.sql.hive.convertMetastoreParquet.mergeSchema=false --conf spark.sql.shuffle.partitions=2000 --conf “spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins” --conf “spark.driver.extraClassPath=pinot-batch-ingestion-spark-2.4-${PINOT_VERSION}-SNAPSHOT-shaded.jar:pinot-all-${PINOT_VERSION}-SNAPSHOT-jar-with-dependencies.jar:pinot-s3-${PINOT_VERSION}-SNAPSHOT-shaded.jar:pinot-parquet-${PINOT_VERSION}-SNAPSHOT-shaded.jar” --conf “spark.executor.extraClassPath=pinot-batch-ingestion-spark-2.4-${PINOT_VERSION}-SNAPSHOT-shaded.jar:pinot-all-${PINOT_VERSION}-SNAPSHOT-jar-with-dependencies.jar:pinot-s3-${PINOT_VERSION}-SNAPSHOT-shaded.jar:pinot-parquet-${PINOT_VERSION}-SNAPSHOT-shaded.jar” --jars “${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-SNAPSHOT-jar-with-dependencies.jar,${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-2.4/pinot-batch-ingestion-spark-2.4-${PINOT_VERSION}-SNAPSHOT-shaded.jar,${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-s3/pinot-s3-${PINOT_VERSION}-SNAPSHOT-shaded.jar,${PINOT_DISTRIBUTION_DIR}/plugins/pinot-input-format/pinot-parquet/pinot-parquet-${PINOT_VERSION}-SNAPSHOT-shaded.jar” --files
s3://roku-dea-dev/sand-box/suraj/spark_job_spec_offlinebookingnarrow_perf.yaml local://pinot-all-${PINOT_VERSION}-SNAPSHOT-jar-with-dependencies.jar -jobSpecFile spark_job_spec_offlinebookingnarrow_perf.yaml