Hi I got this issue when submit spark job to ingest batch fi Apache Pinot #troubleshooting

Hi, I got this issue when submit spark-job to inge...

Phúc Huỳnh

04/20/2021, 4:40 AM

Hi, I got this issue when submit spark-job to ingest batch file.

Copy code

21/04/20 03:03:42 ERROR org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand: Got exception to kick off standalone data ingestion job -
java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:144)
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:117)
	at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:132)
	at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.main(LaunchDataIngestionJobCommand.java:67)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.nio.file.FileSystemNotFoundException: Provider "gs" not installed
	at java.nio.file.Paths.get(Paths.java:147)
	at org.apache.pinot.plugin.filesystem.GcsPinotFS.copy(GcsPinotFS.java:262)
	at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner.run(SparkSegmentGenerationJobRunner.java:344)
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:142)
	... 15 more

Phúc Huỳnh

04/20/2021, 4:44 AM

after some deep-dive, I think it’s the Scala filesystem only 2 providers: file, jar

Phúc Huỳnh

04/20/2021, 4:46 AM

base on some explain & discuss : https://stackoverflow.com/questions/39500445/filesystem-provider-disappearing-in-spark I suggest move

java.nio.file.Path

org.apache.hadoop.fs.Path

Have any idea ?

Jackie

04/20/2021, 5:20 AM

@Xiang Fu ^^

Xiang Fu

04/20/2021, 6:48 AM

can you try to add

Copy code

google-cloud-nio

dependency into pinot-gcs pom file?

Xiang Fu

04/20/2021, 6:49 AM

https://mvnrepository.com/artifact/com.google.cloud/google-cloud-nio

Xiang Fu

04/20/2021, 6:51 AM

hmm, seems there is already one in pom:

Copy code

<dependency>
      <groupId>com.google.cloud</groupId>
      <artifactId>google-cloud-nio</artifactId>
      <version>0.120.0-alpha</version>
    </dependency>

Phúc Huỳnh

04/20/2021, 6:52 AM

yup, in pom already has dependency

Xiang Fu

04/20/2021, 7:06 AM

I’m looking at here: https://github.com/googleapis/google-cloud-java/blob/v0.120.0/google-cloud-clients[…]ain/java/com/google/cloud/storage/contrib/nio/package-info.java

Xiang Fu

04/20/2021, 7:06 AM

maybe we can modify this code to directly use google cloudstorage api

Xiang Fu

04/20/2021, 7:07 AM

does this code work on your standalone application

Xiang Fu

04/20/2021, 7:07 AM

or test

Xiang Fu

04/20/2021, 7:07 AM

but only failed on spark?

Phúc Huỳnh

04/20/2021, 7:07 AM

yup.

Phúc Huỳnh

04/20/2021, 7:07 AM

this code work on standalone application

Phúc Huỳnh

04/20/2021, 7:07 AM

but fail on spark cluster on dataproc

Xiang Fu

04/20/2021, 7:08 AM

hmm

Xiang Fu

04/20/2021, 7:09 AM

on spark cluster, can you try to put the gcs plugin to the classpath

Phúc Huỳnh

04/20/2021, 7:09 AM

already tried. 😄

Xiang Fu

04/20/2021, 7:09 AM

I assume this is java 8?

Xiang Fu

04/20/2021, 7:09 AM

or 11?

Phúc Huỳnh

04/20/2021, 7:09 AM

java 8

Phúc Huỳnh

04/20/2021, 7:11 AM

let me check image spark https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0

Xiang Fu

04/20/2021, 7:13 AM

another way is to shade pinot-ingestion-spark and pinot-gcs together into one jar

Phúc Huỳnh

04/20/2021, 7:18 AM

😄 another way to fix, remove

java.nio.file.Paths

in gcs-pinot, then custom another Paths.get function ?

Xiang Fu

04/20/2021, 7:21 AM

right

Xiang Fu

04/20/2021, 7:21 AM

then you need to implement the relatize function there

Xiang Fu

04/20/2021, 7:21 AM

I feel better to just use native google gcs lib

Phúc Huỳnh

04/20/2021, 10:38 AM

well, just another error ini sendSegmentUri pharse.

Copy code

Caused by: java.lang.IllegalStateException: PinotFS for scheme: gs has not been initialized
	at shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:518)
	at org.apache.pinot.spi.filesystem.PinotFSFactory.create(PinotFSFactory.java:80)
	at org.apache.pinot.plugin.ingestion.batch.common.SegmentPushUtils.sendSegmentUris(SegmentPushUtils.java:158)
	at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:122)
	at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:117)

Phúc Huỳnh

04/20/2021, 10:39 AM

are you have any idea how to fix it ?

Phúc Huỳnh

04/20/2021, 10:46 AM

Error logs

Xiang Fu

04/20/2021, 5:21 PM

It’s missing fs init in the code: https://github.com/apache/incubator-pinot/pull/6819

Xiang Fu

04/20/2021, 7:51 PM

merged the above pr, can you try it again ?

Phúc Huỳnh

04/22/2021, 5:11 AM

hmm, i try it again but still errors.

Phúc Huỳnh

04/22/2021, 5:12 AM

Untitled

Phúc Huỳnh

04/22/2021, 6:33 AM

i guess i already find issue, i forget add dir.plugins property in spark job. But i’m found another issue when packing plugins to workers-node.

Phúc Huỳnh

04/22/2021, 6:33 AM

Untitled

Phúc Huỳnh

04/22/2021, 7:46 AM

Copy code

21/04/22 07:25:48 ERROR org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner: Failed to tar plugins directory
java.io.IOException: Request to write '4096' bytes exceeds size in header of '12453302' bytes for entry './pinot-plugins.tar.gz'
	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.write(TarArchiveOutputStream.java:449)

Xiang Fu

04/22/2021, 5:25 PM

hmm, seems like an issue of tar’gz everything in plugin dir to a file

Xiang Fu

04/22/2021, 5:25 PM

I will take a look

Xiang Fu

04/22/2021, 7:00 PM

are you using default plugin dir or you’ve put more files into that

Xiang Fu

04/22/2021, 7:13 PM

do you set this :

plugins.dir

? in your java cmd?

Xiang Fu

04/22/2021, 7:18 PM

I was testing this util locally:

Copy code

public static void main(String[] args) {
    try {
      TarGzCompressionUtils.createTarGzFile(
          new File("/Users/xiangfu/workspace/pinot-dev/pinot-distribution/target/apache-pinot-incubating-0.8.0-SNAPSHOT-bin/apache-pinot-incubating-0.8.0-SNAPSHOT-bin/plugins"),
          new File("/tmp/plugin.tar.gz"));
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

Xiang Fu

04/22/2021, 7:19 PM

it works and cannot reproduce the issue:

Phúc Huỳnh

04/23/2021, 2:36 AM

i’m prettery sure add env plugins.dir Log has INFO

Copy code

21/04/23 02:33:34 INFO org.apache.pinot.spi.plugin.PluginManager: Plugins root dir is [./]
21/04/23 02:33:34 INFO org.apache.pinot.spi.plugin.PluginManager: Trying to load plugins: [[pinot-gcs]]

Full log:

Copy code

:: retrieving :: org.apache.spark#spark-submit-parent-adf0fd1c-d000-4782-8499-d41f1396e726
	confs: [default]
	0 artifacts copied, 9 already retrieved (0kB/17ms)
21/04/23 02:33:34 INFO org.apache.pinot.spi.plugin.PluginManager: Plugins root dir is [./]
21/04/23 02:33:34 INFO org.apache.pinot.spi.plugin.PluginManager: Trying to load plugins: [[pinot-gcs]]
21/04/23 02:33:35 INFO org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher: SegmentGenerationJobSpec:
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
authToken: null
cleanUpOutputDir: false
excludeFileNamePattern: null
executionFrameworkSpec:
  extraConfigs: {stagingDir: '<gs://bucket_name/tmp/>'}
  name: spark
  segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner
  segmentMetadataPushJobRunnerClassName: null
  segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner
  segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner
failOnEmptySegment: false
includeFileNamePattern: glob:**/*.avro
inputDirURI: <gs://bucket_name/rule_logs/>
jobType: SegmentCreationAndUriPush
outputDirURI: <gs://bucket_name/data/>
overwriteOutput: true
pinotClusterSpecs:
- {controllerURI: '<http://localhost:8080>'}
pinotFSSpecs:
- {className: org.apache.pinot.plugin.filesystem.GcsPinotFS, configs: null, scheme: gs}
pushJobSpec: {pushAttempts: 2, pushParallelism: 2, pushRetryIntervalMillis: 1000,
  segmentUriPrefix: null, segmentUriSuffix: null}
recordReaderSpec: {className: org.apache.pinot.plugin.inputformat.avro.AvroRecordReader,
  configClassName: null, configs: null, dataFormat: avro}
segmentCreationJobParallelism: 0
segmentNameGeneratorSpec:
  configs: {segment.name.prefix: rule_logs_uat, exclude.sequence.id: 'true'}
  type: simple
tableSpec: {schemaURI: '<http://localhost:8080/tables/RuleLogsUAT/schema>',
  tableConfigURI: '<http://localhost:8080/tables/RuleLogsUAT>', tableName: RuleLogsUAT}
tlsSpec: null

Phúc Huỳnh

04/23/2021, 2:42 AM

when i remove

-Dplugins.include=pinot-gcs

i found another jar files. Maybe it’s root-cause issue ?

Xiang Fu

04/23/2021, 3:50 AM

for Plugins root dir can you try absolute path

Phúc Huỳnh

04/23/2021, 3:55 AM

let’s me try.

Phúc Huỳnh

04/23/2021, 4:09 AM

hmm, bz spark context working dir base on context_id

Copy code

/tmp/2694644d46744db78cbe27e6dd833f2a

so it’s hard to get absolute path

Xiang Fu

04/23/2021, 6:57 AM

hmm, does

$(pwd)/plugins

work?

Phúc Huỳnh

04/23/2021, 7:00 AM

$(pwd)/plugins

will be current dir on remote control machine. Not worker exec machine

Xiang Fu

04/23/2021, 7:07 AM

hmm

Xiang Fu

04/23/2021, 7:07 AM

the ingestion job only tar gz plugin dir on driver and set it

Xiang Fu

04/23/2021, 7:08 AM

the worker will untar the plugin dir from the context

Xiang Fu

04/23/2021, 7:10 AM

image.png

Xiang Fu

04/23/2021, 7:10 AM

this is how we package the plugin dir and add to sparkContext

Xiang Fu

04/23/2021, 7:12 AM

then each worker will untar the plugin dir from the targz file:

Xiang Fu

04/23/2021, 7:12 AM

you can check the class:

org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner

Phúc Huỳnh

04/23/2021, 7:22 AM

hmm, i will try with option: • cluster initialization-actions -> gsutil cp plugins dir to cluster folder.

Xiang Fu

04/23/2021, 7:38 AM

i see, then it means the spark driver has no access to plugin dir?

Xiang Fu

04/23/2021, 7:38 AM

or you will download the plugin dir ?

Phúc Huỳnh

04/23/2021, 7:51 AM

i will predownload the plugins dir in relative folder

Phúc Huỳnh

04/23/2021, 7:57 AM

all steps almost done. But the final step is SegmentUriPushJob is errors Left Tab: Logging spark job Right Tab: logging pinot-controller

Phúc Huỳnh

04/23/2021, 7:58 AM

PinotFS for scheme: gs has not been initialized

again 😞

Phúc Huỳnh

04/23/2021, 8:10 AM

oh, i finded the issue. i loaded jar

pinot-batch-ingestion-spark

from release branch

Xiang Fu

04/23/2021, 8:36 AM

so the filesystem is not init yet

Xiang Fu

04/23/2021, 8:36 AM

I think you found the issue

Phúc Huỳnh

04/24/2021, 2:32 AM

hmmm. One more issue

Phúc Huỳnh

04/24/2021, 2:42 AM

Untitled

Phúc Huỳnh

04/24/2021, 2:44 AM

API

v2/segments

seem conflict vs spark sendSegmentUris. Something null that’s make API return internal server errors

Phúc Huỳnh

04/24/2021, 4:28 AM

• logging in spark job request:

Copy code

Start sending table RuleLogsUAT segment URIs: [gs://{bucket{/data/year=2020/RuleLogsUAT_OFFLINE_18316_18627_0.tar.gz] to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@499782c3]" 

Sending table RuleLogsUAT segment URI: gs://{bucket}data/year=2020/RuleLogsUAT_OFFLINE_18316_18627_0.tar.gz to location: https://{domain} for 

Sending request: https://{domain}/v2/segments to controller: pinot-controller-0.pinot-controller-headless.analytics.svc.cluster.local, version: Unknown

Caught temporary exception while pushing table: RuleLogsUAT segment uri: gs://{bucket}/data/year=2020/RuleLogsUAT_OFFLINE_18316_18627_0.tar.gz to https://{domain}, will retry

Got error status code: 500 (Internal Server Error) with reason: "Caught internal server exception while uploading segment" while sending request: https://{domain}/v2/segments to controller: pinot-controller-0.pinot-controller-headless.analytics.svc.cluster.local, version: Unknown	at org.apache.pinot.common.utils.FileUploadDownloadClient.sendRequest(FileUploadDownloadClient.java:451)	at org.apache.pinot.common.utils.FileUploadDownloadClient.sendSegmentUri(FileUploadDownloadClient.java:771)	at org.apache.pinot.segment.local.utils.SegmentPushUtils.lambda$sendSegmentUris$1(SegmentPushUtils.java:178)	at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:50)	at org.apache.pinot.segment.local.utils.SegmentPushUtils.sendSegmentUris(SegmentPushUtils.java:175)	at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:127)	at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:117)	at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:352)	at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:352)	at scala.collection.Iterator.foreach(Iterator.scala:943)	at scala.collection.Iterator.foreach$(Iterator.scala:943)	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)	at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1012)	at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1012)	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242)	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)	at org.apache.spark.scheduler.Task.run(Task.scala:131)	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)	at java.lang.Thread.run(Thread.java:748)

Phúc Huỳnh

04/24/2021, 5:21 AM

As far as i can tell, Maybe headers request is null then default uploadType is

SEGMENT

-> exception when multiPart file null

Xiang Fu

04/24/2021, 8:22 AM

hmm

Xiang Fu

04/24/2021, 8:22 AM

for your pinot controller/server, do they have those pinot filesystem configured?

Xiang Fu

04/24/2021, 8:25 AM

also, can you try segmentMetadataPush

Xiang Fu

04/24/2021, 8:28 AM

Copy code

segmentMetadataPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner
jobType: SegmentCreationAndMetadataPush

Phúc Huỳnh

04/26/2021, 2:48 AM

well, i think i has found the problem. In nginx-ingress log, i found:

Copy code

4: *519 client sent invalid header line: "DOWNLOAD_URI: <gs://my-bucket-test-data-np/data/RuleLogsUAT_OFFLINE_18117_18731_0.tar.gz>" while reading client request headers, client: 10.255.160.94,

Phúc Huỳnh

04/26/2021, 2:53 AM

http://nginx.org/en/docs/http/ngx_http_core_module.html#underscores_in_headers

Phúc Huỳnh

04/26/2021, 2:53 AM

Copy code

When the use of underscores is disabled, request header fields whose names contain underscores are marked as invalid and become subject to the ignore_invalid_headers directive.

Phúc Huỳnh

04/26/2021, 2:54 AM

I think nginx ingress remove all headers, which are the invalid name

Xiang Fu

04/26/2021, 9:03 AM

oic

Xiang Fu

04/26/2021, 9:04 AM

then you need to change nginx to keep those headers

Open in Slack

Previous Next