Scott deRegt
05/27/2022, 3:57 PMspark
Batch Ingestion job when moving from --master local --deploy-mode client
to --master yarn --deploy-mode cluster
(as suggested here for production environments). I would greatly appreciate some guidance from others who have successfully configured this spark job. Details in thread ๐งตScott deRegt
05/27/2022, 3:59 PMspark-submit
locally using the following command:
sudo spark-submit --verbose \
--class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \
--master local --deploy-mode client \
--conf spark.local.dir=/mnt \
--conf "spark.driver.extraJavaOptions=-Dplugins.dir=/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins -Dplugins.include=pinot-s3,pinot-parquet -Dlog4j2.configurationFile=/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/conf/pinot-ingestion-job-log4j2.xml" \
--conf "spark.driver.extraClassPath=/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.11.0-SNAPSHOT-shaded.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins/pinot-file-system/pinot-s3/pinot-s3-0.11.0-SNAPSHOT-shaded.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.11.0-SNAPSHOT-shaded.jar" \
/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar \
-jobSpecFile /mnt/pinot/daily_channel_user_metrics_20220502.yaml
Scott deRegt
05/27/2022, 4:00 PM--master yarn
, whether using --deploy-mode client
or --deploy-mode cluster
.Scott deRegt
05/27/2022, 4:01 PMyarn
being misconfigured on my EMR cluster, I successfully ran this example:
spark-submit --master yarn --deploy-mode cluster --class "org.apache.spark.examples.JavaSparkPi" /usr/lib/spark/examples/jars/spark-examples.jar
Scott deRegt
05/27/2022, 4:05 PMsudo spark-submit --verbose \
--class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \
--master yarn --deploy-mode cluster \
--conf spark.local.dir=/mnt \
--conf "spark.driver.extraJavaOptions=-Dplugins.dir=/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins -Dplugins.include=pinot-s3,pinot-parquet -Dlog4j2.configurationFile=/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/conf/pinot-ingestion-job-log4j2.xml" \
--conf "spark.driver.extraClassPath=/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.11.0-SNAPSHOT-shaded.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins/pinot-file-system/pinot-s3/pinot-s3-0.11.0-SNAPSHOT-shaded.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.11.0-SNAPSHOT-shaded.jar" \
--conf "spark.executor.extraJavaOptions=-Dplugins.dir=/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins -Dplugins.include=pinot-s3,pinot-parquet -Dlog4j2.configurationFile=/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/conf/pinot-ingestion-job-log4j2.xml" \
--conf "spark.executor.extraClassPath=/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.11.0-SNAPSHOT-shaded.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins/pinot-file-system/pinot-s3/pinot-s3-0.11.0-SNAPSHOT-shaded.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.11.0-SNAPSHOT-shaded.jar" \
--files /mnt/pinot/daily_channel_user_metrics_20220502.yaml \
/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar \
-jobSpecFile daily_channel_user_metrics_20220502.yaml
which gets stuck when trying to add executor tasks, complaining that ApplicationMaster
has not yet registered:
2022/05/27 16:03:34.967 INFO [DAGScheduler] [dag-scheduler-event-loop] Submitting 1000 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at parallelize at SparkSegmentGenerationJobRunner.java:237) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2022/05/27 16:03:34.968 INFO [YarnClusterScheduler] [dag-scheduler-event-loop] Adding task set 0.0 with 1000 tasks
2022/05/27 16:03:39.866 WARN [YarnSchedulerBackend$YarnSchedulerEndpoint] [dispatcher-event-loop-2] Attempted to request executors before the AM has registered!
2022/05/27 16:03:39.867 WARN [ExecutorAllocationManager] [spark-dynamic-executor-allocation] Unable to reach the cluster manager to request 11 total executors!
It eventually times out and fails.Scott deRegt
05/27/2022, 4:16 PMjava.lang.ClassNotFoundException
of pinot libs.
Unclear to me how the Driver seems to be able to execute a portion of class main (lists s3 files and tries to start tasks) yet ApplicationMaster seems to fail to boot and register properly.Xiang Fu
Xiang Fu
Kartik Khare
05/27/2022, 7:29 PMspark.driver.extraClassPath
with --jars
argument as well. That will solve the issue.Kartik Khare
05/27/2022, 7:30 PMspark.driver.extraJavaOptions=
Scott deRegt
05/27/2022, 7:34 PMScott deRegt
05/27/2022, 7:48 PMspark.driver.extraJavaOptions
, I appear to lose stdout (I think due to the log4j config file).
stderr is logging this with new attempt:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.yarn.api.records.Resource.newInstance(JJII)Lorg/apache/hadoop/yarn/api/records/Resource;
at org.apache.spark.deploy.yarn.YarnAllocator.<init>(YarnAllocator.scala:153)
at org.apache.spark.deploy.yarn.YarnRMClient.createAllocator(YarnRMClient.scala:84)
at org.apache.spark.deploy.yarn.ApplicationMaster.createAllocator(ApplicationMaster.scala:438)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:485)
at <http://org.apache.spark.deploy.yarn.ApplicationMaster.org|org.apache.spark.deploy.yarn.ApplicationMaster.org>$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:308)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:248)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:248)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:248)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:783)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:782)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:247)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:807)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Scott deRegt
05/27/2022, 7:51 PMsudo spark-submit --verbose \
--class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \
--master yarn --deploy-mode cluster \
--conf spark.local.dir=/mnt \
--conf "spark.driver.extraClassPath=/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.11.0-SNAPSHOT-shaded.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins/pinot-file-system/pinot-s3/pinot-s3-0.11.0-SNAPSHOT-shaded.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.11.0-SNAPSHOT-shaded.jar" \
--conf "spark.executor.extraClassPath=/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.11.0-SNAPSHOT-shaded.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins/pinot-file-system/pinot-s3/pinot-s3-0.11.0-SNAPSHOT-shaded.jar:/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.11.0-SNAPSHOT-shaded.jar" \
--files /mnt/pinot/daily_channel_user_metrics_20220502.yaml \
--jars /mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.11.0-SNAPSHOT-shaded.jar,/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar,/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins/pinot-file-system/pinot-s3/pinot-s3-0.11.0-SNAPSHOT-shaded.jar,/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.11.0-SNAPSHOT-shaded.jar \
/mnt/pinot/apache-pinot-0.11.0-SNAPSHOT-bin/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar \
-jobSpecFile daily_channel_user_metrics_20220502.yaml
Kartik Khare
05/27/2022, 7:52 PMScott deRegt
05/27/2022, 8:21 PMHADOOP_CLASSPATH
in logs here, so this seems plausible:
YARN executor launch context:
env:
CLASSPATH -> ...
SPARK_YARN_CONTAINER_CORES -> ...
SPARK_DIST_CLASSPATH -> ...
SPARK_YARN_STAGING_DIR -> ...
SPARK_USER -> ...
JAVA_HOME -> ...
SPARK_PUBLIC_DNS -> ...
Scott deRegt
05/27/2022, 8:26 PM/etc/hadoop/conf/hadoop-env.sh
though.Scott deRegt
05/27/2022, 8:41 PMhadoop classpath
from emr master node (where I am submitting the application from) is giving me this result:
$ hadoop classpath
/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*::/etc/tez/conf:/usr/lib/tez/*:/usr/lib/tez/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*
Scott deRegt
05/29/2022, 4:22 AMspark version 2.4.8-amzn-0
Hadoop 2.10.1-amzn-2
His guidance on a fix:
so these unexpected hadoop deps might be coming fromplugin (although that uses hadoop 2.10.1 which is what the EMR cluster is on)pinot-parquet
Anyways, here's what you can do
โข open pom.xml inpinot -> pinot-plugins -> pinot-input-format -> pinot-parquet
โข change the dependency scope ofandhadoop-common
fromhadoop-mapreduce-client-core
tocompile
provided
โข recompile the code
Kartik Khare
05/29/2022, 4:24 AM