I am working airlineStats example with Pinot 0 11 0 and tryi Apache Pinot #troubleshooting

I am working airlineStats example, with Pinot 0.11...

Edgaras Kryževičius

09/28/2022, 11:47 AM

I am working airlineStats example, with Pinot 0.11.0 and trying to do spark 3.2 ingestion job. Default example works, but when I change inputDirURI to ADLS instead of local file system and change PinotFSSpecs scheme, I start getting error:

Copy code

Caused by: java.lang.IllegalStateException: PinotFS for scheme: abfs has not been initialized

This is spark command I am running:

Copy code

spark-submit \
--class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \
--master local \
--deploy-mode client \
--conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins" \
--conf "spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3.2/pinot-batch-ingestion-spark-3.2-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-adls/pinot-adls-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-input-format/pinot-parquet/pinot-parquet-${PINOT_VERSION}-shaded.jar" \
--conf "spark.executor.extraClassPath=${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3.2/pinot-batch-ingestion-spark-3.2-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-adls/pinot-adls-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-input-format/pinot-parquet/pinot-parquet-${PINOT_VERSION}-shaded.jar" \
local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar -jobSpecFile ${PINOT_DISTRIBUTION_DIR}/SparkIngestionJob.yaml

SparkIngestionJob.yaml:

Copy code

executionFrameworkSpec:
  name: 'spark'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentMetadataPushJobRunner'

  extraConfigs:
    stagingDir: examples/batch/airlineStats/staging

jobType: SegmentCreationAndTarPush

inputDirURI: '<abfs://fs@accountname/...>'
includeFileNamePattern: 'glob:**/*.avro'

outputDirURI: 'examples/batch/airlineStats/segments'

overwriteOutput: true
pinotFSSpecs:
    - scheme: adl2
      className: org.apache.pinot.plugin.filesystem.ADLSGen2PinotFS
      configs:
        accountName: '..'
        accessKey: '..'
        fileSystemName: '..'

recordReaderSpec:
  dataFormat: 'avro'
  className: 'org.apache.pinot.plugin.inputformat.avro.AvroRecordReader'

tableSpec:
  tableName: 'airlineStats'
  schemaURI: '<http://20.207.206.121:9000/tables/airlineStats/schema>'
  tableConfigURI: '<http://20.207.206.121:9000/tables/airlineStats>'

segmentNameGeneratorSpec:
  type: normalizedDate
  configs:
    segment.name.prefix: 'airlineStats_batch'
    exclude.sequence.id: true

pinotClusterSpecs:
  - controllerURI: '<http://20.207.206.121:9000>'

pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
  pushRetryIntervalMillis: 1000

I am also attaching my values.yml file, which is used to deploy Pinot using helm.

values.yaml

Edgaras Kryževičius

09/28/2022, 1:16 PM

I fixed it by changing abfs://fs@accountname/... to adl2://fs@accountname/ in SparkIngestionJob.yml file.

Mayank

09/28/2022, 2:00 PM

Yea, thanks

Navina

09/28/2022, 6:08 PM

mm.. what is the difference between

abfs

and

adl2

Edgaras Kryževičius

09/29/2022, 3:42 PM

To my understanding, there is no difference, it's just that adl2 is defined as scheme in sparkIngestionJobSpec.yaml file

👍 1

Mayank

09/29/2022, 8:24 PM

Adls is Azure data lake. Abs is azure blob store. Adls internally uses Abs, and we decided to support Adls for deep store for the guarantees it provides

👍 1

Open in Slack

Previous Next