Hi, I am trying to load parquet files using spark ...
# troubleshooting
n
Hi, I am trying to load parquet files using spark ingestion task. I had build the jars for java 8 using command
Copy code
mvn install package -DskipTests -DskipiTs -Pbin-dist -Drat.ignoreErrors=true -Djdk.version=8 -Dspark.version=2.4.5
While running the task i getting the error
Copy code
21/08/20 15:11:24 ERROR LaunchDataIngestionJobCommand: Exception caught: 
Can't construct a java object for tag:<http://yaml.org|yaml.org>,2002:org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec; exception=Class not found: org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
 in 'string', line 1, column 1:
    executionFrameworkSpec:
    ^

	at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:349)
	at org.yaml.snakeyaml.constructor.BaseConstructor.constructObject(BaseConstructor.java:182)
	at org.yaml.snakeyaml.constructor.BaseConstructor.constructDocument(BaseConstructor.java:141)
	at org.yaml.snakeyaml.constructor.BaseConstructor.getSingleData(BaseConstructor.java:127)
	at org.yaml.snakeyaml.Yaml.loadFromReader(Yaml.java:450)
	at org.yaml.snakeyaml.Yaml.loadAs(Yaml.java:427)
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.getSegmentGenerationJobSpec(IngestionJobLauncher.java:94)
I have loaded all the jars in spark class path. Any idea how to resolve this??
Ingestion spec:
Copy code
executionFrameworkSpec:
  name: 'spark'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
  extraConfigs:
    stagingDir: 's3://<bucket>/<staging_path>/'
jobType: SegmentCreationAndTarPush
inputDirURI: 's3://<bucket>/<parquet_file_path>/'
includeFileNamePattern: 'glob:**/*.parquet'
outputDirURI: 's3://<bucket>/<path>/'
overwriteOutput: true
pinotFSSpecs:
  - scheme: s3
    className: org.apache.pinot.plugin.filesystem.S3PinotFS
    configs:
      region: '<region>'
recordReaderSpec:
  dataFormat: 'parquet'
  className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'
tableSpec:
  tableName: '<pinot_table_name>'
pinotClusterSpecs:
  - controllerURI: '<controller_uri>'
pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
  pushRetryIntervalMillis: 1000
I have checked spark cluster logs and I can see that the jars are present and I tried running the command
Class.forName("org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec")
using notebook and it ran the command successfully Cluster spark conf:
Copy code
spark.driver.extraClassPath=/home/yarn/pinot/plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.8.0-shaded.jar:/home/yarn/pinot/lib/pinot-all-0.8.0-jar-with-dependencies.jar:/home/yarn/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.8.0-shaded.jar:/home/yarn/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.8.0-shaded.jar:/home/yarn/pinot/pinot-spi-0.8.0-SNAPSHOT.jar

spark.driver.extraJavaOptions=-Dplugins.dir=/home/yarn/pinot/plugins -Dplugins.include=pinot-s3,pinot-parquet
In logs i can see the jars loaded
Copy code
21/08/20 15:10:25 INFO DriverCorral: Successfully attached library s3a://<jar_path>/pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar to Spark
.......
.......
21/08/20 15:11:21 WARN SparkContext: The jar /local_disk0/tmp/addedFile4953461200729207388pinot_all_0_8_0_SNAPSHOT_jar_with_dependencies-2dd68.jar has been added already. Overwriting of added jars is not supported in the current version.
And the plugins are also successfully loaded
Copy code
21/08/20 15:11:23 INFO PluginManager: Plugins root dir is [/home/yarn/pinot/plugins]
21/08/20 15:11:23 INFO PluginManager: Trying to load plugins: [[pinot-s3, pinot-parquet]]
21/08/20 15:11:23 INFO PluginManager: Trying to load plugin [pinot-parquet] from location [/home/yarn/pinot/plugins/pinot-input-format/pinot-parquet]
21/08/20 15:11:23 INFO PluginManager: Successfully loaded plugin [pinot-parquet] from jar files: [file:/home/yarn/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.8.0-SNAPSHOT-shaded.jar]
21/08/20 15:11:23 INFO PluginManager: Successfully Loaded plugin [pinot-parquet] from dir [/home/yarn/pinot/plugins/pinot-input-format/pinot-parquet]
21/08/20 15:11:23 INFO PluginManager: Trying to load plugin [pinot-s3] from location [/home/yarn/pinot/plugins/pinot-file-system/pinot-s3]
21/08/20 15:11:23 INFO PluginManager: Successfully loaded plugin [pinot-s3] from jar files: [file:/home/yarn/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.8.0-SNAPSHOT-shaded.jar]
21/08/20 15:11:23 INFO PluginManager: Successfully Loaded plugin [pinot-s3] from dir [/home/yarn/pinot/plugins/pinot-file-system/pinot-s3]
m
Seems like the yaml file has issues, can you double check on that?
n
According to stack trace, the cause is class not found
Copy code
Caused by: org.yaml.snakeyaml.error.YAMLException: Class not found: org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
	at org.yaml.snakeyaml.constructor.Constructor.getClassForNode(Constructor.java:660)
	at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.getConstructor(Constructor.java:335)
The line of code where it breaks also suggests the same
Copy code
try {
                cl = getClassForName(name);
            } catch (ClassNotFoundException e) {
                throw new YAMLException("Class not found: " + name);
            }

    protected Class<?> getClassForName(String name) throws ClassNotFoundException {
        return Class.forName(name);
    }
This doesn’t looks like a yaml file issue. I had cross verified yaml file and it looks fine. I have attached the yaml file for cross verification
Complete stack trace:
Copy code
21/08/20 15:11:24 ERROR LaunchDataIngestionJobCommand: Got exception to generate IngestionJobSpec for data ingestion job - 
Can't construct a java object for tag:<http://yaml.org|yaml.org>,2002:org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec; exception=Class not found: org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
 in 'string', line 1, column 1:
    executionFrameworkSpec:
    ^

	at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:349)
	at org.yaml.snakeyaml.constructor.BaseConstructor.constructObject(BaseConstructor.java:182)
	at org.yaml.snakeyaml.constructor.BaseConstructor.constructDocument(BaseConstructor.java:141)
	at org.yaml.snakeyaml.constructor.BaseConstructor.getSingleData(BaseConstructor.java:127)
	at org.yaml.snakeyaml.Yaml.loadFromReader(Yaml.java:450)
	at org.yaml.snakeyaml.Yaml.loadAs(Yaml.java:427)
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.getSegmentGenerationJobSpec(IngestionJobLauncher.java:94)
	at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:125)
	at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.main(LaunchDataIngestionJobCommand.java:74)
	at lined4163e7e07a34bf996c1eea078dbedab25.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command--1:1)
	at lined4163e7e07a34bf996c1eea078dbedab25.$read$$iw$$iw$$iw$$iw$$iw.<init>(command--1:44)
	at lined4163e7e07a34bf996c1eea078dbedab25.$read$$iw$$iw$$iw$$iw.<init>(command--1:46)
	at lined4163e7e07a34bf996c1eea078dbedab25.$read$$iw$$iw$$iw.<init>(command--1:48)
	at lined4163e7e07a34bf996c1eea078dbedab25.$read$$iw$$iw.<init>(command--1:50)
	at lined4163e7e07a34bf996c1eea078dbedab25.$read$$iw.<init>(command--1:52)
	at lined4163e7e07a34bf996c1eea078dbedab25.$read.<init>(command--1:54)
	at lined4163e7e07a34bf996c1eea078dbedab25.$read$.<init>(command--1:58)
	at lined4163e7e07a34bf996c1eea078dbedab25.$read$.<clinit>(command--1)
	at lined4163e7e07a34bf996c1eea078dbedab25.$eval$.$print$lzycompute(<notebook>:7)
	at lined4163e7e07a34bf996c1eea078dbedab25.$eval$.$print(<notebook>:6)
	at lined4163e7e07a34bf996c1eea078dbedab25.$eval.$print(<notebook>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:793)
	at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1054)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:645)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:644)
	at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
	at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
	at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:644)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:576)
	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:572)
	at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:215)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:202)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:202)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:202)
	at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:714)
	at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:667)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:202)
	at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:396)
	at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:373)
	at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:238)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:233)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:49)
	at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:275)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:49)
	at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:373)
	at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
	at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
	at scala.util.Try$.apply(Try.scala:192)
	at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639)
	at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485)
	at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
	at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.yaml.snakeyaml.error.YAMLException: Class not found: org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
	at org.yaml.snakeyaml.constructor.Constructor.getClassForNode(Constructor.java:660)
	at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.getConstructor(Constructor.java:335)
	at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:345)
	... 59 more
n
Are there extra spaces before "executionFrameworkSpec" ? Looks like it from the message you pasted. Yaml is very particular about spaces
n
No extra spaces before
executionFrameworkSpec
. I have attached yaml file as well in above message for cross reference
b
Try using --jars instead of extraClasspath
m
@Kulbir Nijjer I recall you recently faced the issue, spark wasn’t seeing the class. Can we add FAQ around it?
k
@Nisheet please try with --jars option first but if that doesn't help then in your ingestion Spec YAML add entry
dependencyJarDir
in extraConfigs, where you will need to first copy all the jars present in <PINOT_HOME>/plugins folder to S3
Copy code
extraConfigs:

 # stagingDir is used in distributed filesystem to host all the segments then move this directory entirely to output directory.
stagingDir: 's3://<somepath>/'
dependencyJarDir: 's3://<somePath>/plugins'
and that should address the CNFE. This is a known issue and we are debugging why jars are not getting deployed added to YARN nodes, so meanwhile above could be a short term workaround.
m
Thanks @Kulbir Nijjer
n
Hi, i tried both of the above approach with
--jars
while spark submit and by providing plugin path in yaml file. I am still getting same
Copy code
Class not found: org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
c
Yes, +1 Can confirm this parsing and Class not found issue even in GCP env [for both approaches]
n
Yes
k
@Nisheet That’s odd cos same setup works fine on my end, can u share your Spark submit command ?
n
Spark submit command:
Copy code
spark-submit --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand --jars /home/yarn/pinot/plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.8.0-SNAPSHOT-shaded.jar,/home/yarn/pinot/lib/pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar,/home/yarn/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.8.0-SNAPSHOT-shaded.jar,/home/yarn/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.8.0-SNAPSHOT-shaded.jar --conf "spark.driver.extraJavaOptions=-Dplugins.dir=/home/yarn/pinot/plugins -Dplugins.include=pinot-s3,pinot-parquet" dbfs:/databricks/jars/pinot/lib/pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar -jobSpecFile /tmp/spec_file.yaml
@Kulbir Nijjer When i am running the example provided in pinot package locally, then also i am getting this error on local(with jars config. Have not tried yaml dependecyDir config for this example)
Copy code
spark-submit \
  --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \
  --master "local[2]" \
  --deploy-mode client \
  --conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml" \
  --jars "${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-SNAPSHOT-jar-with-dependencies.jar,${${PINOT_DISTRIBUTION_DIR}}/plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.8.0-SNAPSHOT-shaded.jar" \
  ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-SNAPSHOT-jar-with-dependencies.jar \
  -jobSpecFile ${PINOT_DISTRIBUTION_DIR}/examples/batch/airlineStats/sparkIngestionJobSpec.yaml
Error message:
Copy code
Can't construct a java object for tag:<http://yaml.org|yaml.org>,2002:org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec; exception=Class not found: org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
 in 'string', line 21, column 1:
    executionFrameworkSpec:
    ^

	at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:345)
	at org.yaml.snakeyaml.constructor.BaseConstructor.constructObject(BaseConstructor.java:182)
k
Hmm I am able to run similar command on my local setup except it's 0.7.1. We can probably debug this tomorrow morning PST time. @Nisheet will ping u around 9:30 PST and if available we can hop on a zoom to debug.
👍 1
k
This looks very similar to my issue. When we switched to 0.8 (from 0.7.1), the initial issue was:
Copy code
21/08/26 19:00:12 ERROR command.LaunchDataIngestionJobCommand: Got exception to kick off standalone data ingestion job - 
java.lang.RuntimeException: Failed to create IngestionJobRunner instance for class - org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentGenerationJobRunner
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:139)
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:103)
	at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:143)
	at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.main(LaunchDataIngestionJobCommand.java:74)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
Caused by: java.lang.ClassNotFoundException: org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentGenerationJobRunner
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at org.apache.pinot.spi.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:77)
	at org.apache.pinot.spi.plugin.PluginManager.createInstance(PluginManager.java:294)
	at org.apache.pinot.spi.plugin.PluginManager.createInstance(PluginManager.java:265)
	at org.apache.pinot.spi.plugin.PluginManager.createInstance(PluginManager.java:246)
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:137)
	... 9 more
Then we defined HADOOP_CLASSPATH to include a bunch of Pinot jars…
Copy code
export HADOOP_CLASSPATH=${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-batch-ingestion/pinot-batch-ingestion-hadoop/pinot-batch-ingestion-hadoop-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-hdfs/pinot-hdfs-${PINOT_VERSION}-shaded.jar
After that it got past the initial job submission phase, but fails in all the Hadoop map tasks with:
Copy code
21/08/26 19:08:35 INFO mapreduce.Job: Task Id : attempt_1629495469244_4781_m_000046_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.pinot.plugin.filesystem.HadoopPinotFS
	at org.apache.pinot.spi.filesystem.PinotFSFactory.register(PinotFSFactory.java:56)
	at org.apache.pinot.plugin.ingestion.batch.hadoop.HadoopSegmentCreationMapper.setup(HadoopSegmentCreationMapper.java:104)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: org.apache.pinot.plugin.filesystem.HadoopPinotFS
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at org.apache.pinot.spi.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:77)
	at org.apache.pinot.spi.plugin.PluginManager.createInstance(PluginManager.java:294)
	at org.apache.pinot.spi.plugin.PluginManager.createInstance(PluginManager.java:265)
	at org.apache.pinot.spi.plugin.PluginManager.createInstance(PluginManager.java:246)
	at org.apache.pinot.spi.filesystem.PinotFSFactory.register(PinotFSFactory.java:51)
	... 9 more