Hi Pinot team, I am facing dependency conflict is...
# troubleshooting
a
Hi Pinot team, I am facing dependency conflict issues while running spark ingestion for both spark 2.4 and 3.2. (Both different errors though) I am using latest pinot version (0.11.0-SNAPSHOT) Error on spark 2.4
Copy code
Caused by: java.lang.NoSuchMethodError: 'org.apache.pinot.shaded.org.apache.commons.configuration.PropertiesConfiguration org.apache.pinot.spi.env.CommonsConfigurationUtils.fromFile(java.io.File)'
	at org.apache.pinot.segment.spi.index.metadata.SegmentMetadataImpl.getPropertiesConfiguration(SegmentMetadataImpl.java:161)
Error on spark 3.2
Copy code
Exception in thread "main" java.lang.ExceptionInInitializerError
Caused by: java.lang.NullPointerException
	at org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(SystemUtils.java:1626)
both on jdk*11,* on aws emrs I think it’s this particular combination (Pinot, Spark, S3, Parquet) thats not working. I am trying to remove some of them to narrow down the problem. Just wanted to know if this has worked for anyone
This is the ingest config
Copy code
cat > /tmp/spark_batch_job_spec.yml << EOF

executionFrameworkSpec:

  name: 'spark'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner'


jobType: SegmentCreationAndTarPush
inputDirURI: '<s3://pinot-ec2-poc/source_data/small/mytable/>'
includeFileNamePattern: 'glob:**/*.parquet'
outputDirURI: <s3://pinot-ec2-poc/data/small/mytable>
overwriteOutput: true
pinotFSSpecs:
    - 
        scheme: s3
        className: org.apache.pinot.plugin.filesystem.S3PinotFS
        configs:
            region: us-west-2

recordReaderSpec:
    dataFormat: 'parquet'
    className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'
tableSpec:
    tableName: 'mytable'
    schemaURI: 'http://<controller>:9000/tables/mytable/schema'
    tableConfigURI: 'http://<controller>:9000/tables/mytable'
pinotClusterSpecs:
    - controllerURI: 'http://<controller>:9000'
EOF
and the spark submit command
Copy code
sudo spark-submit \
  --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \
  --deploy-mode client \
  --jars "/opt/pinot/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3.2/pinot-batch-ingestion-spark-3.2-0.11.0-SNAPSHOT-shaded.jar,/opt/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.11.0-SNAPSHOT-shaded.jar,/opt/pinot/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar,/opt/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.11.0-SNAPSHOT-shaded.jar" \
  --conf "spark.driver.extraJavaOptions=-Dlog4j2.configurationFile=/opt/pinot/conf/pinot-ingestion-job-log4j2.xml" \
  --conf "spark.driver.extraClassPath=/opt/pinot/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3.2/pinot-batch-ingestion-spark-3.2-0.11.0-SNAPSHOT-shaded.jar:/opt/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.11.0-SNAPSHOT-shaded.jar:/opt/pinot/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar:/opt/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.11.0-SNAPSHOT-shaded.jar" \
  --conf "spark.executor.extraClassPath=/opt/pinot/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3.2/pinot-batch-ingestion-spark-3.2-0.11.0-SNAPSHOT-shaded.jar:/opt/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.11.0-SNAPSHOT-shaded.jar:/opt/pinot/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar:/opt/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.11.0-SNAPSHOT-shaded.jar" \
  /opt/pinot/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar -jobSpecFile /tmp/spark_batch_job_spec.yml
m
@Kartik Khare ^^
k
Hi, I guess it is picking up some old commons-lang from EMR since the pinot one is currently shaded. What you can do is added commons-lang-3.11 jar from here https://repo1.maven.org/maven2/org/apache/commons/commons-lang3/3.11/ as part of classpath as well as --jars argument in spark jars The current commons lang version doesn't support java11 in its enum and hence leads to the error
thankyou 1
Another way is to use spark2.4 with java8.
m
Can we document the problem and workaround @Kartik Khare?
k
Faced this issue first time. Will document it.
m
Thanks, given Spark based jobs always runs into similar issue, let’s keep a running doc to capture all problems and solutions
k
m
🙏🙏❤️
a
Hey @Kartik Khare, I finally continued testing spark ingestion on emr. Currently focussing on spark 2.4 on emr, jdk 11, pinot-11.0-SNAPSHOT distribution It now fails with
java.lang.NullPointerException
on
Copy code
java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner
rechecked everything, confs, jars etc look fine. Any idea what would be causing this
k
Can you send me the full stack trace.
👍 1