Trying to run the airlineStats offline ingestion e...
# troubleshooting
a
Trying to run the airlineStats offline ingestion example on our internal cluster. Getting this error.
Copy code
Caused by: org.apache.avro.AvroRuntimeException: Not a valid schema field: $ts$WEEK
	at org.apache.avro.generic.GenericData$Record.get(GenericData.java:256)
	at org.apache.pinot.plugin.inputformat.avro.AvroRecordExtractor.extract(AvroRecordExtractor.java:76)
	at org.apache.pinot.plugin.inputformat.avro.AvroRecordReader.next(AvroRecordReader.java:74)
	at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:66)
	at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:37)
	at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:178)
	at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:152)
r
• could you share more on the
apache-pinot-0.11.0-bin
you are using. ◦ is this a pre-built binary? a docker image? • how are you deploying it ◦ do you have the complete shell script you use to launch the cluster and ingestion?
a
That is the release
apache-pinot-0.11.0-bin
of apache pinot I downloaded
This the example I have been trying to run on my EKS cluster
pinot on EKS was installed via helm chart.
Screen Shot 2022-10-11 at 3.29.25 PM.png
Table and schema created via script provided in that example
r
can you share the link you use to download the binary?
and the EKS helm chart values.yaml you were using
a
this is the helm chart we use
S3 fs enabled changes
Copy code
extra:
    configs: |-
      pinot.set.instance.id.to.hostname=true
      controller.task.scheduler.enabled=true
      controller.data.dir=s3://${data_bucket_name}/controller-data
      controller.local.temp.dir=/tmp/pinot-tmp-data/
      pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
      pinot.controller.storage.factory.s3.region=us-east-1
      pinot.controller.storage.factory.s3.disableAcl=false
      pinot.controller.segment.fetcher.protocols=file,http,s3
      pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
just enabled the ingress in values section otherwise exactly same.
Copy code
apiVersion: "<http://sparkoperator.k8s.io/v1beta2|sparkoperator.k8s.io/v1beta2>"
kind: SparkApplication
metadata:
  name: airline-stats-ingest-testing
  namespace: dev
spec:
  type: Java
  mode: cluster
  image: "datamechanics/spark:3.2.1-hadoop-3.3.1-java-11-scala-2.12-python-3.8-latest"
  imagePullPolicy: Always
  sparkVersion: 3.2.1
  mainClass: org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand
  mainApplicationFile: <s3://dev-xxx-testing/spark-jars/pinot-all-0.11.0-jar-with-dependencies.jar>
  arguments:
    - "-jobSpecFile"
    - "/mnt/config/sparkAirlineStatIngestionJobSpec.yaml"
  deps:
    jars:
      - <s3://dev-xxx-testing/spark-jars/pinot-all-0.11.0-jar-with-dependencies.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-batch-ingestion-spark-3.2-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-avro-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-csv-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-parquet-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-s3-0.11.0-shaded.jar>
  hadoopConf:
    com.amazonaws.services.s3.enableV4: "true"
    fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
    fs.AbstractFileSystem.s3.impl: "org.apache.hadoop.fs.s3a.S3A"
    fs.s3.aws.credentials.provider: "com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain"
  sparkConf:
    spark.kubernetes.namespace: dev
    spark.driver.extraJavaOptions: "-Dplugins.dir=${CLASSPATH} -Dlog4j2.configurationFile=/mnt/config/pinot-ingestion-job-log4j2.xml"
    spark.driver.extraClassPath: "pinot-all-0.11.0-jar-with-dependencies.jar:pinot-avro-0.11.0-shaded.jar:pinot-batch-ingestion-spark-3.2-0.11.0-shaded.jar:pinot-csv-0.11.0-shaded.jar:pinot-parquet-0.11.0-shaded.jar:pinot-s3-0.11.0-shaded.jar"
    spark.executor.extraClassPath: "pinot-all-0.11.0-jar-with-dependencies.jar:pinot-avro-0.11.0-shaded.jar:pinot-batch-ingestion-spark-3.2-0.11.0-shaded.jar:pinot-csv-0.11.0-shaded.jar:pinot-parquet-0.11.0-shaded.jar:pinot-s3-0.11.0-shaded.jar"
r
seems like you are trying to use a 0.11 binary against a latest docker image
you might need to synchronize the version you use on helm and the binary you use to launch ingestion
a
how to check that binary ?
the latest helm chart version does not point to latest binary ?
r
yes it does. but you are using
apache-pinot-0.11.0-bin
and you helm is latest (e.g. current master which is even newer than
apache-pinot-0.12.0-bin
a
any pointer which docker image we can use in helm chart or may be use another version of helm chart to sync with
gotcha
r
if you use latest helm you should use latest binary. if you want to use 0.11.0-bin you need to use tag 0.11.0 release
a
let me ask the platform to to switch to that
is it in values,yaml ?
found it
I think we should set it to
release-0.11.0
thank you for you help and appreciated.
r
no problem. thank you for sharing the details!
a
@Rong R same error
Copy code
22/10/12 16:42:00 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) (10.20.72.20 executor 1): org.apache.avro.AvroRuntimeException: Not a valid schema field: $ts$WEEK
	at org.apache.avro.generic.GenericData$Record.get(GenericData.java:256)
	at org.apache.pinot.plugin.inputformat.avro.AvroRecordExtractor.extract(AvroRecordExtractor.java:76)
Copy code
curl -X GET "<https://pinot.dev.zzzz.io/version>" -H "accept: application/json"
{"pinot-protobuf":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-kafka-2.0":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-avro":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-distribution":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-csv":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-s3":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-yammer":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-segment-uploader-default":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-batch-ingestion-standalone":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-confluent-avro":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-thrift":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-orc":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-azure":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-gcs":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-dropwizard":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-hdfs":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-adls":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-kinesis":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-json":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-minion-builtin-tasks":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-parquet":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033","pinot-segment-writer-file-based":"0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033"}%
using the same version of client-side jars
Copy code
deps:
    jars:
      - <s3://dev-xxx-testing/spark-jars/pinot-all-0.11.0-jar-with-dependencies.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-batch-ingestion-spark-3.2-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-avro-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-csv-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-parquet-0.11.0-shaded.jar>
      - <s3://dev-xxx-testing/spark-jars/pinot-s3-0.11.0-shaded.jar>
r
could you file a github issue for this?
a
Will do.
r
also is it fine ingesting not using spark but directly using the local ingestion?
a
for trying example its absolutely fine but for production use case we have spark pipelines we would like to integrate with that
r
yeah please try it so taht we can pin point if this is the spark plugin issue or general arvo processing issue
a
hence was trying out this way because infra is already setup at our end.
So I think, as per that example, there is no such field
$ts$WEEK,
and I am assuming it is computed dynamically at table creation time.
Copy code
"fieldConfigList": [
      {
        "name": "ts",
        "encodingType": "DICTIONARY",
        "indexType": "TIMESTAMP",
        "indexTypes": [
          "TIMESTAMP"
        ],
        "timestampConfig": {
          "granularities": [
            "DAY",
            "WEEK",
            "MONTH"
          ]
        }
      }
    ]
It's a bit surprising why spark Avro reader is looking for that field in the Avro file though.
Ok, confirming it seems Spark ingestion is an issue but works standalone
r
as I expected! 👍 please file a github issue.
a
Will do. Thnx for your help
r
Thank you so much! Please include as many details as possible. I will tag the appropriate folks who might have more knowledge :-)
👍 1