Hi All :wave:, I'm trying to configure a date form...
# troubleshooting
f
Hi All 👋, I'm trying to configure a date format like this "_2020-12-31T195921.522-0400_" and created table-schema.json as
Copy code
"dateTimeFieldSpecs": [{
    "name": "timestampCustom",
    "dataType": "STRING",
    "format" : "1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss.SSZZ",
    "granularity": "1:MILLISECONDS"
  }]
Table is generated successfully but POST command returns
Copy code
{
  "code": 500,
  "error": "Caught exception when ingesting file into table: foo_OFFLINE. null"
}
I discovered is related to date format, could you kindly indicate how should it be? I used this site to generate the custom format. Thanks in advance!
x
looks like you had a typo. maybe try
SSSZ
instead? as noted in the website.
m
@User ^^
d
This looks like the date format we are using. This is the config we've set up:
Copy code
{
  "name": "deletedAt",
  "dataType": "STRING",
  "format": "1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss.SSSZ",
  "granularity": "1:MILLISECONDS"
}
f
Still failing 😕 This is what I have: • table-schema.json
Copy code
{
  "schemaName": "ads13",
  "dimensionFieldSpecs": [
    {
      "name": "id",
      "dataType": "INT"
    },
    {
      "name": "value",
      "dataType": "STRING"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "timestampCustom",
      "dataType": "STRING",
      "format": "1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss.SSSZ",
      "granularity": "1:MILLISECONDS"
    }
  ]
}
• table-config.json
Copy code
{
  "tableName": "ads13",
  "tableType": "OFFLINE",
  "segmentsConfig": {
    "replication": 1,
    "timeColumnName": "timestampCustom",
    "timeType": "MILLISECONDS",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": 365
  },
  "tenants": {
    "broker": "DefaultTenant",
    "server": "DefaultTenant"
  },
  "tableIndexConfig": {
    "loadMode": "MMAP"
  },
  "ingestionConfig": {
    "batchIngestionConfig": {
      "segmentIngestionType": "APPEND",
      "segmentIngestionFrequency": "DAILY"
    }
  },
  "metadata": {}
}
• data.cvs
Copy code
id,value,timestampCustom
1,foo,2020-12-31T19:59:21.522-0400
And then I run:
Copy code
/opt/pinot/bin/pinot-admin.sh AddTable -tableConfigFile table-config.json -schemaFile table-schema.json -exec

curl -X POST -F file=@data.csv -H "Content-Type: multipart/form-data"   "<http://localhost:9000/ingestFromFile?tableNameWithType=ads13_OFFLINE&batchConfigMapStr=%7B%22inputFormat%22%3A%22csv%22%7D>"
m
Any logs around what step failed?
f
UPDATE: I discovered that issue comes from time format (
HH:mm:ss
) because colon mark does some trick. For example, this format works
1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HHmm
(removed colon mark) with this _data.csv_:
Copy code
id,value,timestampCustom
1,foo,2020-12-31T1959
But doesn't work this format
1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm
(added colon mark after hour) with this _data.csv_:
Copy code
id,value,timestampCustom
1,foo,2020-12-31T19:59
What do you recommend? Thanks in advance!
m
Cc @User if you can help document
e
Hello guys, I'm working with Facundo on the same test. Question: it will be the same behavior if we use an
ingestionJobSpec.yaml
instead
of using the API? Is it worth it to try?
x
I tried to reproduce the issue from my end with this setup above. I see some failure in controller logs on segment name generation.
Copy code
java.lang.IllegalArgumentException: null
        at shaded.com.google.common.base.Preconditions.checkArgument(Preconditions.java:108) ~[startree-pinot-all-0.10.0-ST.36-jar-with-dependencies.jar:0.10.0-ST.36-565e66063a82d0b4a61c73bfcddbbb3cd0d436ac]
        at org.apache.pinot.segment.spi.creator.name.SimpleSegmentNameGenerator.generateSegmentName(SimpleSegmentNameGenerator.java:53) ~[startree-pinot-all-0.10.0-ST.36-jar-with-dependencies.jar:0.10.0-ST.36-565e66063a82d0b4a61c73bfcddbbb3cd0d436ac]
ingestFromFile endpoint
is hard coded to use ‘simple’ name generator (as it was added mainly for test purpose) and simple generator doesn’t work with date formatted time column. For date formatted time column, it’s recommended to use ‘normalizedDate’ segment name generator type (docs). @User when using ingestion job, the generator type can be configured to ‘normalizedDate’ (docs), hopefully overcoming this issue.
e
thanks @User, after running the ingestion using spark, we're facing the error described in this thread. Pinot is deployed using helm in k8, the helm is installing
PINOT_VERSION=0.10.0-SNAPSHOT
. Is safe to download 0.8.0 jars and try again?
after re-installing 0.8.0 in k8s and we're able to move forward, now the error is:
Caused by: java.lang.ClassNotFoundException: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner
Attached the yaml used
x
looks like you’ve set
standalone
for the execution farmework name, I think it’d be
spark
(docs)
e
same error 😕
Caused by: java.lang.ClassNotFoundException: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner
this is the spark_job.sh
Copy code
export PINOT_VERSION=0.8.0
export PINOT_ROOT_DIR=/opt/pinot
export SPARK_HOME=/root/spark-2.4.8-bin-hadoop2.7
export PINOT_DISTRIBUTION_DIR=/opt/pinot


cd ${PINOT_DISTRIBUTION_DIR}

${SPARK_HOME}/bin/spark-submit \
  --class org.apache.pinot.tools.admin.PinotAdministrator \
  --master "local[2]" \
  --deploy-mode client \
  --conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml" \
  --conf "spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar" \
  local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \
  LaunchDataIngestionJob \
  -jobSpecFile '/opt/pinot/data/ingestJob.yml'
x
cc @User pls help shed some light ^
e
Hello, it seems like the message error is right and
SparkSegmentGenerationJobRunner
isn't included in the
jar
. doing a grep inside the jar, I only found the
IngestionJobRunner
interface
Copy code
jar tvf ../lib/pinot-all-0.8.0-jar-with-dependencies.jar | grep JobRunner
   318 Tue Aug 24 23:32:56 UTC 2021 org/apache/pinot/spi/ingestion/batch/runner/IngestionJobRunner.class
I was thinking to re-build the 0.8.0 version locally and push it into the k8s cluster. Another option could be to re-deploy the pinot helm using the 0.9.3 version. what do you recommend? Thanks
f
Hello, there is any update?
m
@User ^^
x
I tried to bring back the main method from
org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand
It will be available in 0.10.0 release
It’s also in new master branch
e
Hello guys, debugging again I realize that the error is right, I missed the spark plugin jar in the classpath, I run
cp -r plugins-external/pinot-batch-ingestion plugins/
and move forward. 😄 Then got the following error:
Copy code
Caused by: shaded.com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson version: 2.10.0
	at shaded.com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)
	at shaded.com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
	at shaded.com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:808)
	at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
	at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
I debugging locally version 0.10.0 with
spark-2.4.0-bin-hadoop2.7
from here: https://archive.apache.org/dist/spark/spark-2.4.0/ Any suggestion is welcomed!
This is the spark-submit script:
Copy code
export PINOT_VERSION=0.10.0
export PINOT_DISTRIBUTION_DIR=/opt/pinot
export SPARK_HOME=/root/spark-2.4.0-bin-hadoop2.7

cd ${PINOT_DISTRIBUTION_DIR}

${SPARK_HOME}/bin/spark-submit \
  --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \
  --master "local[2]" \
  --deploy-mode client \
  --conf "spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins -Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-log4j2.xml" \
  --conf "spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-s3/pinot-s3-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-input-format/pinot-parquet/pinot-parquet-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-hdfs/pinot-hdfs-${PINOT_VERSION}-shaded.jar" \
local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-dependencies.jar \
  -jobSpecFile '/opt/pinot/data/ingestJob.yml'
x
hmm I think pinot is using jackson 2.10.0…
for that, I feel we need to re-shade the pinot-batch-ingestion-spark jar with compatible jackson version
e
Hi Xiang, it seems like the docker image already has the shaded .jar
Copy code
root@pinot-controller:/opt/pinot/data# ls ../plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark/
pinot-batch-ingestion-spark-0.10.0-shaded.jar
x
yes, I feel spark got the jackson from pinot-all jar
e
after rebuild only the
pinot-batch-ingestion-spark
using the same jackson as spark I was able to move forward. Now, I got an error when the Job is trying to push the segments metadata:
Copy code
java.io.IOException: Failed to find file: metadata.properties in: /tmp/segmentTar-cb3750db-872e-4bbe-9a04-7ce859a18581.tar.gz
	at org.apache.pinot.common.utils.TarGzCompressionUtils.untarOneFile(TarGzCompressionUtils.java:198) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.segment.local.utils.SegmentPushUtils.generateSegmentMetadataFile(SegmentPushUtils.java:344) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.segment.local.utils.SegmentPushUtils.sendSegmentUriAndMetadata(SegmentPushUtils.java:238) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner$1.call(SparkSegmentMetadataPushJobRunner.java:124) ~[pinot-batch-ingestion-spark-0.10.0-shaded.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner$1.call(SparkSegmentMetadataPushJobRunner.java:112) ~[pinot-batch-ingestion-spark-0.10.0-shaded.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
I checked and the file doesn't exist:
Copy code
root@pinot-controller:/tmp# ls -lah
total 51M
drwxrwxrwt 1 root root 4.0K Apr  8 17:15 .
drwxr-xr-x 1 root root 4.0K Apr  8 17:03 ..
-rw-r--r-- 1 root root  26M Apr  8 17:15 adv1_OFFLINE_2022-03-01_2022-03-01_0.tar.gz
-rw-r--r-- 1 root root  26M Apr  8 17:15 adv1_OFFLINE_2022-03-01_2022-03-01_1.tar.gz
drwxr-xr-x 4 root root 4.0K Apr  6 18:30 data
drwxr-xr-x 4 root root 4.0K Apr  8 17:15 pinot-49d1110b-8481-4dfa-a058-3a22348445ce
drwxr-xr-x 4 root root 4.0K Apr  8 17:04 pinot-4f8933d2-c89c-487e-8737-ebce2a72bcfc
drwxr-xr-x 4 root root 4.0K Apr  8 17:05 pinot-87b83a51-068c-4cac-b714-5f627bb0ac58
drwxr-xr-x 4 root root 4.0K Apr  8 17:15 pinot-dbcff700-a3d7-4445-bc87-b3179e1b87c8
x
have you tried SegmentTarPush type
SegmentMetadataPush requires you set output directory to be remote accessible directory, e.g. s3
so any executor will be able to download it and extract the metadata
e
Cool using
SegmentCreationAndTarPush
the ingestion finished ok. I will open other thread about segments. thanks!
👍 1