Trying to run the airlineStats offline ingestion example on Apache Pinot #troubleshooting

Trying to run the airlineStats offline ingestion e...

Alvin

10/11/2022, 9:24 PM

Trying to run the airlineStats offline ingestion example on our internal cluster. Getting this error.

Copy code

Caused by: org.apache.avro.AvroRuntimeException: Not a valid schema field: $ts$WEEK
	at org.apache.avro.generic.GenericData$Record.get(GenericData.java:256)
	at org.apache.pinot.plugin.inputformat.avro.AvroRecordExtractor.extract(AvroRecordExtractor.java:76)
	at org.apache.pinot.plugin.inputformat.avro.AvroRecordReader.next(AvroRecordReader.java:74)
	at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:66)
	at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:37)
	at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:178)
	at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:152)

Rong R

10/11/2022, 10:25 PM

• could you share more on the

apache-pinot-0.11.0-bin

you are using. ◦ is this a pre-built binary? a docker image? • how are you deploying it ◦ do you have the complete shell script you use to launch the cluster and ingestion?

Alvin

10/11/2022, 10:28 PM

That is the release

apache-pinot-0.11.0-bin

of apache pinot I downloaded

Alvin

10/11/2022, 10:28 PM

This the example I have been trying to run on my EKS cluster

Alvin

10/11/2022, 10:29 PM

pinot on EKS was installed via helm chart.

Alvin

10/11/2022, 10:29 PM

Screen Shot 2022-10-11 at 3.29.25 PM.png

Alvin

10/11/2022, 10:30 PM

Table and schema created via script provided in that example

Rong R

10/11/2022, 10:32 PM

can you share the link you use to download the binary?

Rong R

10/11/2022, 10:32 PM

and the EKS helm chart values.yaml you were using

Alvin

10/11/2022, 10:32 PM

https://pinot.apache.org/download

Alvin

10/11/2022, 10:36 PM

https://github.com/apache/pinot/tree/master/kubernetes/helm

Alvin

10/11/2022, 10:36 PM

this is the helm chart we use

Alvin

10/11/2022, 10:36 PM

https://github.com/apache/pinot/blob/master/kubernetes/helm/pinot/values.yaml

Alvin

10/11/2022, 10:38 PM

S3 fs enabled changes

Alvin

10/11/2022, 10:38 PM

Copy code

extra:
    configs: |-
      pinot.set.instance.id.to.hostname=true
      controller.task.scheduler.enabled=true
      controller.data.dir=s3://${data_bucket_name}/controller-data
      controller.local.temp.dir=/tmp/pinot-tmp-data/
      pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
      pinot.controller.storage.factory.s3.region=us-east-1
      pinot.controller.storage.factory.s3.disableAcl=false
      pinot.controller.segment.fetcher.protocols=file,http,s3
      pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher

Alvin

10/11/2022, 10:38 PM

just enabled the ingress in values section otherwise exactly same.

Alvin

10/11/2022, 10:40 PM

Job Spec

sparkAirlineStatIngestionShare.yaml

Alvin

10/11/2022, 10:44 PM

Copy code

apiVersion: "<http://sparkoperator.k8s.io/v1beta2|sparkoperator.k8s.io/v1beta2>"
kind: SparkApplication
metadata:
  name: airline-stats-ingest-testing
  namespace: dev
spec:
  type: Java
  mode: cluster
  image: "datamechanics/spark:3.2.1-hadoop-3.3.1-java-11-scala-2.12-python-3.8-latest"
  imagePullPolicy: Always
  sparkVersion: 3.2.1
  mainClass: org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand
  mainApplicationFile: <s3://dev-xxx-testing/spark-jars/pinot-all-0.11.0-jar-with-dependencies.jar>
  arguments:
    - "-jobSpecFile"
    - "/mnt/config/sparkAirlineStatIngestionJobSpec.yaml"
  deps:
    jars:
      - <s3://dev-xxx-testing/spark-jars/pinot-all-0.11.0-jar-with-dependencies.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-batch-ingestion-spark-3.2-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-avro-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-csv-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-parquet-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-s3-0.11.0-shaded.jar>
  hadoopConf:
    com.amazonaws.services.s3.enableV4: "true"
    fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
    fs.AbstractFileSystem.s3.impl: "org.apache.hadoop.fs.s3a.S3A"
    fs.s3.aws.credentials.provider: "com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain"
  sparkConf:
    spark.kubernetes.namespace: dev
    spark.driver.extraJavaOptions: "-Dplugins.dir=${CLASSPATH} -Dlog4j2.configurationFile=/mnt/config/pinot-ingestion-job-log4j2.xml"
    spark.driver.extraClassPath: "pinot-all-0.11.0-jar-with-dependencies.jar:pinot-avro-0.11.0-shaded.jar:pinot-batch-ingestion-spark-3.2-0.11.0-shaded.jar:pinot-csv-0.11.0-shaded.jar:pinot-parquet-0.11.0-shaded.jar:pinot-s3-0.11.0-shaded.jar"
    spark.executor.extraClassPath: "pinot-all-0.11.0-jar-with-dependencies.jar:pinot-avro-0.11.0-shaded.jar:pinot-batch-ingestion-spark-3.2-0.11.0-shaded.jar:pinot-csv-0.11.0-shaded.jar:pinot-parquet-0.11.0-shaded.jar:pinot-s3-0.11.0-shaded.jar"

Rong R

10/11/2022, 10:45 PM

seems like you are trying to use a 0.11 binary against a latest docker image

Rong R

10/11/2022, 10:45 PM

you might need to synchronize the version you use on helm and the binary you use to launch ingestion

Alvin

10/11/2022, 10:46 PM

how to check that binary ?

Alvin

10/11/2022, 10:46 PM

the latest helm chart version does not point to latest binary ?

Rong R

10/11/2022, 10:47 PM

yes it does. but you are using

apache-pinot-0.11.0-bin

and you helm is latest (e.g. current master which is even newer than

apache-pinot-0.12.0-bin

Alvin

10/11/2022, 10:47 PM

any pointer which docker image we can use in helm chart or may be use another version of helm chart to sync with

Alvin

10/11/2022, 10:47 PM

gotcha

Rong R

10/11/2022, 10:47 PM

if you use latest helm you should use latest binary. if you want to use 0.11.0-bin you need to use tag 0.11.0 release

Alvin

10/11/2022, 10:48 PM

let me ask the platform to to switch to that

Alvin

10/11/2022, 10:48 PM

is it in values,yaml ?

Alvin

10/11/2022, 10:49 PM

found it

Alvin

10/11/2022, 10:49 PM

https://github.com/apache/pinot/blob/master/kubernetes/helm/pinot/values.yaml#L24

Alvin

10/11/2022, 10:49 PM

I think we should set it to

release-0.11.0

Alvin

10/11/2022, 10:50 PM

thank you for you help and appreciated.

Rong R

10/11/2022, 10:51 PM

no problem. thank you for sharing the details!

Alvin

10/12/2022, 4:42 PM

@Rong R same error

Alvin

10/12/2022, 4:42 PM

Copy code

22/10/12 16:42:00 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) (10.20.72.20 executor 1): org.apache.avro.AvroRuntimeException: Not a valid schema field: $ts$WEEK
	at org.apache.avro.generic.GenericData$Record.get(GenericData.java:256)
	at org.apache.pinot.plugin.inputformat.avro.AvroRecordExtractor.extract(AvroRecordExtractor.java:76)

Alvin

10/12/2022, 4:43 PM

Copy code

curl -X GET "<https://pinot.dev.zzzz.io/version>" -H "accept: application/json"
{"pinot-protobuf":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-kafka-2.0":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-avro":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-distribution":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-csv":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-s3":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-yammer":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-segment-uploader-default":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-batch-ingestion-standalone":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-confluent-avro":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-thrift":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-orc":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-azure":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-gcs":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-dropwizard":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-hdfs":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-adls":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-kinesis":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-json":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-minion-builtin-tasks":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-parquet":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-segment-writer-file-based":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033"}%

Alvin

10/12/2022, 4:43 PM

using the same version of client-side jars

Alvin

10/12/2022, 4:44 PM

Copy code

deps:
    jars:
      - <s3://dev-xxx-testing/spark-jars/pinot-all-0.11.0-jar-with-dependencies.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-batch-ingestion-spark-3.2-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-avro-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-csv-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-parquet-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-s3-0.11.0-shaded.jar>

Rong R

10/12/2022, 4:44 PM

could you file a github issue for this?

Alvin

10/12/2022, 4:44 PM

Will do.

Rong R

10/12/2022, 4:44 PM

also is it fine ingesting not using spark but directly using the local ingestion?

Alvin

10/12/2022, 4:46 PM

for trying example its absolutely fine but for production use case we have spark pipelines we would like to integrate with that

Rong R

10/12/2022, 4:46 PM

yeah please try it so taht we can pin point if this is the spark plugin issue or general arvo processing issue

Alvin

10/12/2022, 4:46 PM

hence was trying out this way because infra is already setup at our end.

Alvin

10/12/2022, 4:49 PM

So I think, as per that example, there is no such field

$ts$WEEK,

and I am assuming it is computed dynamically at table creation time.

Copy code

"fieldConfigList": [
      {
        "name": "ts",
        "encodingType": "DICTIONARY",
        "indexType": "TIMESTAMP",
        "indexTypes": [
          "TIMESTAMP"
        ],
        "timestampConfig": {
          "granularities": [
            "DAY",
            "WEEK",
            "MONTH"
          ]
        }
      }
    ]

Alvin

10/12/2022, 4:50 PM

It's a bit surprising why spark Avro reader is looking for that field in the Avro file though.

Alvin

10/12/2022, 5:14 PM

Ok, confirming it seems Spark ingestion is an issue but works standalone

Rong R

10/12/2022, 5:16 PM

as I expected! 👍 please file a github issue.

Alvin

10/12/2022, 5:18 PM

Will do. Thnx for your help

Rong R

10/12/2022, 5:21 PM

Thank you so much! Please include as many details as possible. I will tag the appropriate folks who might have more knowledge :-)

👍 1

Open in Slack

Previous Next