i encounter this issue when trying the spark inges...
# troubleshooting
x
i encounter this issue when trying the spark ingestion:
Copy code
Caused by: java.lang.NullPointerException
        at org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(SystemUtils.java:1626)
        at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
        at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:207)
        ... 27 more
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2611)
        at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner.run(SparkSegmentGenerationJobRunner.java:198)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:93)
        at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:311)
        at org.apache.spark.SparkContext.getOrCreate(SparkContext.scala)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
Exception in thread "main" java.lang.ExceptionInInitializerError
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.SparkEnv$.create(SparkEnv.scala:359)
        at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:189)
        at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:272)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:448)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:125)
        at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:142)
        at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:132)
        at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.main(LaunchDataIngestionJobCommand.java:67)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
        at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:370)
        at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:113)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)ullp
m
What version of Java are you using?
x
Copy code
spark 3.0.2

pinot 0.7.1

java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment 18.9 (build 11.0.10+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.10+9, mixed mode, sharing)
i get the jars for spark submit from here: https://downloads.apache.org/pinot/apache-pinot-incubating-0.7.1/
Copy code
${SPARK_HOME}/bin/spark-submit \
  --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \
  --deploy-mode cluster \
  --conf "spark.driver.extraJavaOptions=-Dplugins.dir=/opt/pinot/plugins -Dlog4j2.configurationFile=/opt/pinot/conf/pinot-ingestion-job-log4j2.xml" \
  --conf "spark.driver.extraClassPath=/opt/pinot/lib/pinot-all-0.7.1-jar-with-dependencies.jar:/opt/pinot/plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.7.1-shaded.jar:/opt/pinot/lib/pinot-all-0.7.1-jar-with-dependencies.jar:/opt/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.7.1-shaded.jar:/opt/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.7.1-shaded.jar" \
  --jars local:///opt/pinot/plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.7.1-shaded.jar,local:///opt/pinot/lib/pinot-all-0.7.1-jar-with-dependencies.jar,local:///opt/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.7.1-shaded.jar,local:///opt/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.7.1-shaded.jar \
  local:///opt/pinot/lib/pinot-all-0.7.1-jar-with-dependencies.jar -jobSpecFile jobSpec.yaml | tee output
m
We are seeing some issues with newer Spark version, could you try Spark 2.3x?
x
i can’t see that thread, i think it’s buried due to the 10k message limit
let me try some workarounds
b
Just upgrade apache commons to latest in your deployment.
m
Thanks @Bruce Ritchie
x
new errors:
Copy code
libraries used:
spark 2.4.7
apache-commons-lang3 3.12.0 and apache-pinot 0.8.0 (mvn build with jdk 11)

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/pinot/tools/admin/command/LaunchDataIngestionJobCommand has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0
jdk 11 seems to be supported for spark 3 onwards only?
Copy code
libraries used:
spark 3.0.2
apache-commons-lang3 3.12.0 and apache-pinot 0.8.0 (mvn build with jdk 11)

# Observe the following logs happening over and over

INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, 100.64.233.164, 21000, None)
INFO BlockManager: Reporting 0 blocks to the master.
INFO Executor: Told to re-register on heartbeat
INFO BlockManager: BlockManager BlockManagerId(1, 100.64.233.164, 21000, None) re-registering with master
...
shall try building
0.8.0
with jdk 8 and running with my spark
2.4.7
image :3
tried building from source for spark
2.4.7
, was unable to because of missing
2.4.7
for https://repo.maven.apache.org/maven2/com/holdenkarau/spark-testing-base_2.11/
tried building from source for spark
2.4.5
, ended up with this error when running with spark-submit cluster mode on k8s:
Copy code
# build from source
mvn clean install -DskipTests -Pbin-dist -T 4 -Djdk.version=8 -Dhadoop.version=3.1.0 -Dspark.version=2.4.5

# run in a spark image with hadoop 3.1 and spark 2.4.5
${SPARK_HOME}/bin/spark-submit \
  --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \
  --deploy-mode cluster \
  --conf "spark.driver.extraJavaOptions=-Dplugins.dir=..." \
  --conf "spark.driver.extraClassPath=..." \
  --jars ... \
  ... -jobSpecFile jobSpec.yaml
im out of ideas. will probably try
standalone
batch ingestion even if its slower since data freshness is not a concern for me
i think it would help if there was a tested docker image with the spark 2.4.0 + hadoop 2.7.0 dependencies provided for me to run my ingestion job
@Xiang Fu was wondering which docker image i could use to run the spark batch ingestion job? i see this but its no longer available: https://github.com/apache/pinot/pull/4975/files#diff-eb034a8230fa96f6fa24ff5c173626f42eba8d39ca1a1ee2cecd9274e2d8dee8R182
b
As far as JDK 11 with spark, yes, I believe spark 3+ only.
For me the only way to ingest was via kafka.
x
well thats discouraging to hear 😔
m
@Xiang Fu any pointers here ^^
b
I'm sure it can be made to work with spark 2.x, I just am using spark 3/jdk 11 in our environment and can't backport.
x
I haven’t tried spark 3
may worth try to upgrade the spark lib and see what’s the changes if it’s huge, then we can have a new module for spark 3
I don’t think client docker image works with spark model, you need to build and put the jars into spark cluster
x
Think I will give it a shot at backporting spark 3 today
👍 1