What is the correct schema for a date column ? i a...
# troubleshooting
y
What is the correct schema for a date column ? i am using the following,
Copy code
{
  "name": "sls_d",
  "dataType": "STRING",
  "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
  "granularity": "1:DAYS"
}
but i am getting
Copy code
Caused by: java.lang.IllegalArgumentException: Invalid format: "null"
	at org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187) ~[pinot-all.jar:0.4.0-8355d2e0e489a8d127f2e32793671fba505628a8]
n
can you share the whole
dateTimeFieldSpec
object?
y
Copy code
"dateTimeFieldSpecs": [
    {
      "name": "sls_d",
      "dataType": "STRING",
      "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
      "granularity": "1:DAYS"
    }
  ]
n
is there more to that stack trace?
y
Copy code
2020/07/23 09:52:25.681 INFO [DAGScheduler] [dag-scheduler-event-loop] ResultStage 0 (foreach at SparkSegmentGenerationJobRunner.java:214) failed in 91.117 s due to Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 16, <http://brdn2451.target.com|brdn2451.target.com>, executor 4): java.lang.IllegalArgumentException: Invalid format: "null"
	at org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187)
	at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:826)
	at org.apache.pinot.core.segment.creator.impl.SegmentColumnarIndexCreator.writeMetadata(SegmentColumnarIndexCreator.java:399)
	at org.apache.pinot.core.segment.creator.impl.SegmentColumnarIndexCreator.seal(SegmentColumnarIndexCreator.java:360)
	at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.handlePostCreation(SegmentIndexCreationDriverImpl.java:216)
	at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.build(SegmentIndexCreationDriverImpl.java:199)
	at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:102)
	at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner$1.call(SparkSegmentGenerationJobRunner.java:278)
	at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner$1.call(SparkSegmentGenerationJobRunner.java:214)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.apply(JavaRDDLike.scala:351)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.apply(JavaRDDLike.scala:351)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:921)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:921)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
n
does the table config have the
timeColumnName
mentioned?
y
oh in pinot
Copy code
"timeColumnName": "sls_d",
    "timeType": "DAYS",
n
seems to be breaking at this:
Copy code
if (config.getTimeColumnType() == SegmentGeneratorConfig.TimeColumnType.SIMPLE_DATE) {
            // For TimeColumnType.SIMPLE_DATE_FORMAT, convert time value into millis since epoch
            DateTimeFormatter dateTimeFormatter = DateTimeFormat.forPattern(config.getSimpleDateFormat());
            startTime = dateTimeFormatter.parseMillis(startTimeStr);
            endTime = dateTimeFormatter.parseMillis(endTimeStr);
            timeUnit = TimeUnit.MILLISECONDS;
          }
can you confirm that the raw data has a
sls_d
column in the correct format? it looks like this piece of code is receiving null value for time
y
Copy code
sls_d
2019-05-18
2019-05-19
2019-05-20
2019-05-21
2019-05-24
2020-05-21
2020-05-22
n
nothing looks wrong as such.. lemme try to reproduce
k
Is it possible that one of the rows doesn't contain date text in
sls_d
column?
n
summarizing conversation from dm: 1. check if
sls_d
has any nulls 2.
date
in hive gets stored as daysSinceEpoch
INT
in avro https://avro.apache.org/docs/1.8.0/spec.html#Date. Try with
sls_d
format as
1:DAYS:EPOCH
y
Yes Actually, the data was partitioned on sls_d so the actual records did not have the data. Thanks 🙂
👍 1