Hi All wave I m trying to configure a date format like this Apache Pinot #troubleshooting

Hi All :wave:, I'm trying to configure a date form...

Facundo Bianco

03/16/2022, 8:18 PM

Hi All 👋, I'm trying to configure a date format like this "_2020-12-31T195921.522-0400_" and created table-schema.json as

Copy code

"dateTimeFieldSpecs": [{
    "name": "timestampCustom",
    "dataType": "STRING",
    "format" : "1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss.SSZZ",
    "granularity": "1:MILLISECONDS"
  }]

Table is generated successfully but POST command returns

Copy code

{
  "code": 500,
  "error": "Caught exception when ingesting file into table: foo_OFFLINE. null"
}

I discovered is related to date format, could you kindly indicate how should it be? I used this site to generate the custom format. Thanks in advance!

Xiaobing

03/16/2022, 8:51 PM

looks like you had a typo. maybe try

SSSZ

instead? as noted in the website.

Mayank

03/16/2022, 9:47 PM

@User ^^

Diana Arnos

03/17/2022, 1:10 PM

This looks like the date format we are using. This is the config we've set up:

Copy code

{
  "name": "deletedAt",
  "dataType": "STRING",
  "format": "1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss.SSSZ",
  "granularity": "1:MILLISECONDS"
}

Facundo Bianco

03/17/2022, 1:55 PM

Still failing 😕 This is what I have: • table-schema.json

Copy code

{
  "schemaName": "ads13",
  "dimensionFieldSpecs": [
    {
      "name": "id",
      "dataType": "INT"
    },
    {
      "name": "value",
      "dataType": "STRING"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "timestampCustom",
      "dataType": "STRING",
      "format": "1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss.SSSZ",
      "granularity": "1:MILLISECONDS"
    }
  ]
}

• table-config.json

Copy code

{
  "tableName": "ads13",
  "tableType": "OFFLINE",
  "segmentsConfig": {
    "replication": 1,
    "timeColumnName": "timestampCustom",
    "timeType": "MILLISECONDS",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": 365
  },
  "tenants": {
    "broker": "DefaultTenant",
    "server": "DefaultTenant"
  },
  "tableIndexConfig": {
    "loadMode": "MMAP"
  },
  "ingestionConfig": {
    "batchIngestionConfig": {
      "segmentIngestionType": "APPEND",
      "segmentIngestionFrequency": "DAILY"
    }
  },
  "metadata": {}
}

• data.cvs

Copy code

id,value,timestampCustom
1,foo,2020-12-31T19:59:21.522-0400

And then I run:

Copy code

/opt/pinot/bin/pinot-admin.sh AddTable -tableConfigFile table-config.json -schemaFile table-schema.json -exec

curl -X POST -F file=@data.csv -H "Content-Type: multipart/form-data"   "<http://localhost:9000/ingestFromFile?tableNameWithType=ads13_OFFLINE&batchConfigMapStr=%7B%22inputFormat%22%3A%22csv%22%7D>"

Mayank

03/17/2022, 2:30 PM

Any logs around what step failed?

Facundo Bianco

03/17/2022, 2:36 PM

UPDATE: I discovered that issue comes from time format (

HH:mm:ss

) because colon mark does some trick. For example, this format works

1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HHmm

(removed colon mark) with this _data.csv_:

Copy code

id,value,timestampCustom
1,foo,2020-12-31T1959

But doesn't work this format

1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm

(added colon mark after hour) with this _data.csv_:

Copy code

id,value,timestampCustom
1,foo,2020-12-31T19:59

What do you recommend? Thanks in advance!

Mayank

03/17/2022, 2:43 PM

Cc @User if you can help document

Eduardo Cusa

03/17/2022, 2:51 PM

Hello guys, I'm working with Facundo on the same test. Question: it will be the same behavior if we use an ingestionJobSpec.yaml
instead of using the API? Is it worth it to try?

Xiaobing

03/17/2022, 4:15 PM

I tried to reproduce the issue from my end with this setup above. I see some failure in controller logs on segment name generation.

Copy code

java.lang.IllegalArgumentException: null
        at shaded.com.google.common.base.Preconditions.checkArgument(Preconditions.java:108) ~[startree-pinot-all-0.10.0-ST.36-jar-with-dependencies.jar:0.10.0-ST.36-565e66063a82d0b4a61c73bfcddbbb3cd0d436ac]
        at org.apache.pinot.segment.spi.creator.name.SimpleSegmentNameGenerator.generateSegmentName(SimpleSegmentNameGenerator.java:53) ~[startree-pinot-all-0.10.0-ST.36-jar-with-dependencies.jar:0.10.0-ST.36-565e66063a82d0b4a61c73bfcddbbb3cd0d436ac]

ingestFromFile endpoint

is hard coded to use ‘simple’ name generator (as it was added mainly for test purpose) and simple generator doesn’t work with date formatted time column. For date formatted time column, it’s recommended to use ‘normalizedDate’ segment name generator type (docs). @User when using ingestion job, the generator type can be configured to ‘normalizedDate’ (docs), hopefully overcoming this issue.

Eduardo Cusa

03/17/2022, 6:29 PM

thanks @User, after running the ingestion using spark, we're facing the error described in this thread. Pinot is deployed using helm in k8, the helm is installing

PINOT_VERSION=0.10.0-SNAPSHOT

. Is safe to download 0.8.0 jars and try again?

Eduardo Cusa

03/17/2022, 7:18 PM

after re-installing 0.8.0 in k8s and we're able to move forward, now the error is:

Caused by: java.lang.ClassNotFoundException: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner

Attached the yaml used

ingestJob.yml

Xiaobing

03/17/2022, 8:33 PM

looks like you’ve set

standalone

for the execution farmework name, I think it’d be

spark

(docs)

Eduardo Cusa

03/17/2022, 9:08 PM

same error 😕

Caused by: java.lang.ClassNotFoundException: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner

ingestJob.yml

Eduardo Cusa

03/17/2022, 9:33 PM

this is the spark_job.sh

Copy code

export PINOT_VERSION=0.8.0
export PINOT_ROOT_DIR=/opt/pinot
export SPARK_HOME=/root/spark-2.4.8-bin-hadoop2.7
export PINOT_DISTRIBUTION_DIR=/opt/pinot


cd ${PINOT_DISTRIBUTION_DIR}

${SPARK_HOME}/bin/spark-submit \
  --class org.apache.pinot.tools.admin.PinotAdministrator \
  --master "local[2]" \
  --deploy-mode client \
  --conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml" \
  --conf "spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar" \
  local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \
  LaunchDataIngestionJob \
  -jobSpecFile '/opt/pinot/data/ingestJob.yml'

Xiaobing

03/17/2022, 9:51 PM

cc @User pls help shed some light ^

Eduardo Cusa

03/18/2022, 7:54 PM

Hello, it seems like the message error is right and

SparkSegmentGenerationJobRunner

isn't included in the

jar

. doing a grep inside the jar, I only found the

IngestionJobRunner

interface

Copy code

jar tvf ../lib/pinot-all-0.8.0-jar-with-dependencies.jar | grep JobRunner
   318 Tue Aug 24 23:32:56 UTC 2021 org/apache/pinot/spi/ingestion/batch/runner/IngestionJobRunner.class

I was thinking to re-build the 0.8.0 version locally and push it into the k8s cluster. Another option could be to re-deploy the pinot helm using the 0.9.3 version. what do you recommend? Thanks

Facundo Bianco

03/23/2022, 12:26 PM

Hello, there is any update?

Mayank

03/23/2022, 1:08 PM

@User ^^

Xiang Fu

03/23/2022, 6:23 PM

I tried to bring back the main method from

org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand

It will be available in 0.10.0 release

Xiang Fu

03/23/2022, 6:23 PM

It’s also in new master branch

Eduardo Cusa

04/07/2022, 6:20 PM

Hello guys, debugging again I realize that the error is right, I missed the spark plugin jar in the classpath, I run

cp -r plugins-external/pinot-batch-ingestion plugins/

and move forward. 😄 Then got the following error:

Copy code

Caused by: shaded.com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson version: 2.10.0
	at shaded.com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)
	at shaded.com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
	at shaded.com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:808)
	at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
	at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)

I debugging locally version 0.10.0 with

spark-2.4.0-bin-hadoop2.7

from here: https://archive.apache.org/dist/spark/spark-2.4.0/ Any suggestion is welcomed!

Eduardo Cusa

04/07/2022, 6:27 PM

This is the spark-submit script:

Copy code

export PINOT_VERSION=0.10.0
export PINOT_DISTRIBUTION_DIR=/opt/pinot
export SPARK_HOME=/root/spark-2.4.0-bin-hadoop2.7

cd ${PINOT_DISTRIBUTION_DIR}

${SPARK_HOME}/bin/spark-submit \
  --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \
  --master "local[2]" \
  --deploy-mode client \
  --conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml" \
  --conf "spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-s3/pinot-s3-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-input-format/pinot-parquet/pinot-parquet-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-hdfs/pinot-hdfs-${PINOT_VERSION}-shaded.jar" \
local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \
  -jobSpecFile '/opt/pinot/data/ingestJob.yml'

Xiang Fu

04/07/2022, 6:46 PM

hmm I think pinot is using jackson 2.10.0…

Xiang Fu

04/07/2022, 6:48 PM

for that, I feel we need to re-shade the pinot-batch-ingestion-spark jar with compatible jackson version

Eduardo Cusa

04/08/2022, 12:24 PM

Hi Xiang, it seems like the docker image already has the shaded .jar

Copy code

root@pinot-controller:/opt/pinot/data# ls ../plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark/
pinot-batch-ingestion-spark-0.10.0-shaded.jar

Xiang Fu

04/08/2022, 5:30 PM

yes, I feel spark got the jackson from pinot-all jar

Eduardo Cusa

04/08/2022, 5:45 PM

after rebuild only the

pinot-batch-ingestion-spark

using the same jackson as spark I was able to move forward. Now, I got an error when the Job is trying to push the segments metadata:

Copy code

java.io.IOException: Failed to find file: metadata.properties in: /tmp/segmentTar-cb3750db-872e-4bbe-9a04-7ce859a18581.tar.gz
	at org.apache.pinot.common.utils.TarGzCompressionUtils.untarOneFile(TarGzCompressionUtils.java:198) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.segment.local.utils.SegmentPushUtils.generateSegmentMetadataFile(SegmentPushUtils.java:344) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.segment.local.utils.SegmentPushUtils.sendSegmentUriAndMetadata(SegmentPushUtils.java:238) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner$1.call(SparkSegmentMetadataPushJobRunner.java:124) ~[pinot-batch-ingestion-spark-0.10.0-shaded.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner$1.call(SparkSegmentMetadataPushJobRunner.java:112) ~[pinot-batch-ingestion-spark-0.10.0-shaded.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]

I checked and the file doesn't exist:

Copy code

root@pinot-controller:/tmp# ls -lah
total 51M
drwxrwxrwt 1 root root 4.0K Apr  8 17:15 .
drwxr-xr-x 1 root root 4.0K Apr  8 17:03 ..
-rw-r--r-- 1 root root  26M Apr  8 17:15 adv1_OFFLINE_2022-03-01_2022-03-01_0.tar.gz
-rw-r--r-- 1 root root  26M Apr  8 17:15 adv1_OFFLINE_2022-03-01_2022-03-01_1.tar.gz
drwxr-xr-x 4 root root 4.0K Apr  6 18:30 data
drwxr-xr-x 4 root root 4.0K Apr  8 17:15 pinot-49d1110b-8481-4dfa-a058-3a22348445ce
drwxr-xr-x 4 root root 4.0K Apr  8 17:04 pinot-4f8933d2-c89c-487e-8737-ebce2a72bcfc
drwxr-xr-x 4 root root 4.0K Apr  8 17:05 pinot-87b83a51-068c-4cac-b714-5f627bb0ac58
drwxr-xr-x 4 root root 4.0K Apr  8 17:15 pinot-dbcff700-a3d7-4445-bc87-b3179e1b87c8

Xiang Fu

04/09/2022, 9:22 PM

have you tried SegmentTarPush type

Xiang Fu

04/09/2022, 9:22 PM

SegmentMetadataPush requires you set output directory to be remote accessible directory, e.g. s3

Xiang Fu

04/09/2022, 9:23 PM

so any executor will be able to download it and extract the metadata

Xiang Fu

04/09/2022, 9:23 PM

https://docs.pinot.apache.org/basics/data-import/batch-ingestion#3.-segment-metadata-push