https://pinot.apache.org/ logo
k

Ken Krugler

06/02/2021, 9:33 PM
I’m running into an issue when building segments with 0.7.1 that didn’t occur with 0.6.0, due to (I think) using a Unicode code point for my
multiValueDelimiter
The relevant bit of my job file is:
Copy code
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
  configs:
    multiValueDelimiter: '\ufff0'
With 0.6.0 this works fine. With 0.7.1 I get:
Copy code
shaded.com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize instance of `char` out of VALUE_STRING token
 at [Source: UNKNOWN; line: -1, column: -1] (through reference chain: org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig["multiValueDelimiter"])
	at shaded.com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:59) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at shaded.com.fasterxml.jackson.databind.DeserializationContext.reportInputMismatch(DeserializationContext.java:1442) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at shaded.com.fasterxml.jackson.databind.DeserializationContext.handleUnexpectedToken(DeserializationContext.java:1216) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at shaded.com.fasterxml.jackson.databind.DeserializationContext.handleUnexpectedToken(DeserializationContext.java:1126) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at shaded.com.fasterxml.jackson.databind.deser.std.NumberDeserializers$CharacterDeserializer.deserialize(NumberDeserializers.java:448) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at shaded.com.fasterxml.jackson.databind.deser.std.NumberDeserializers$CharacterDeserializer.deserialize(NumberDeserializers.java:405) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at shaded.com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:129) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:288) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at shaded.com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:1719) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at shaded.com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:1350) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at org.apache.pinot.spi.utils.JsonUtils.jsonNodeToObject(JsonUtils.java:117) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:88) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$run$0(SegmentGenerationJobRunner.java:199) ~[pinot-batch-ingestion-standalone-0.7.1-shaded.jar:0.7.1-e22be7c3a39e840321d3658e7505f21768b228d6]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_291]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_291]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_291]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_291]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]
m

Mayank

06/02/2021, 10:48 PM
I am guessing we moved to a newer version of jackson that is having trouble reading the delimiter into a char?
k

Ken Krugler

06/02/2021, 11:29 PM
Well, it’s OK if I use
multiValueDelimiter: 'a'
, but it’s not OK if I do something like
multiValueDelimiter: '\u0040'
. Where in the code is the job yaml file converted to a RecordReaderSpec?
m

Mayank

06/02/2021, 11:41 PM
Check
IngestionJobLauncher.java
Assuming that you are using it
k

Ken Krugler

06/02/2021, 11:41 PM
Yes, thanks - working on a unit test to see if I can find the issue :)
m

Mayank

06/02/2021, 11:42 PM
Cool, thanks
Either there's a code change or a lib change that is not able to handle your delim.
k

Ken Krugler

06/03/2021, 3:20 PM
Looks like YAML parser used by Pinot 0.6.0 had a bug where it would treat \ufff0 as a Unicode escape sequence inside of single quotes, but according to the latest spec that’s only supposed to happen when it’s in double-quotes. So changing my job spec to look like:
Copy code
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
  configs:
    multiValueDelimiter: "\ufff0"
(double-quotes for
multiValueDelimiter
value) fixed the problem.
m

Mayank

06/03/2021, 3:21 PM
Thanks @Ken Krugler for finding this.
k

Ken Krugler

06/03/2021, 3:21 PM
Though in checking the 0.6.0 vs 0.7.1 pom.xml, it seems like both used
snakeyaml
`1.16`…hmmm
m

Mayank

06/03/2021, 3:22 PM
May be transitive dependency?
Would you mind adding this to FAQ?
k

Ken Krugler

06/03/2021, 3:25 PM
Sure
m

Mayank

06/03/2021, 3:33 PM
Thanks
k

Ken Krugler

06/11/2021, 6:40 PM
Done
m

Mayank

06/11/2021, 6:40 PM
thankyou