https://pinot.apache.org/ logo
m

Mike Davis

05/20/2021, 9:57 PM
Hello, are transform configs supported when generating OFFLINE segments? I'm trying to add a new column via a date transformation and getting:
Copy code
Caught exception while gathering stats
org.apache.parquet.io.InvalidRecordException: NEW_FIELD_NAME not found in message schema {
ingestionConfig:
Copy code
"ingestionConfig": {
    "transformConfigs": [
      {
        "columnName": "NEW_FIELD_NAME",
        "transformFunction": "fromEpochDays(OLD_FIELD_NAME)"
      }
    ]
  },
n

Neha Pawar

05/20/2021, 10:19 PM
yes it is supported.
the exception looks like it’s coming from parquet? can you share the whole stack trace?
m

Mike Davis

05/20/2021, 10:22 PM
yeah I thought it might be a parquet issue:
Copy code
Caught exception while gathering stats
org.apache.parquet.io.InvalidRecordException: NEW_FIELD_NAME not found in message schema {
<...schema omitted...>
}

  at org.apache.parquet.schema.GroupType.getFieldIndex(GroupType.java:175) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at org.apache.pinot.plugin.inputformat.parquet.ParquetNativeRecordExtractor.extract(ParquetNativeRecordExtractor.java:117) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at org.apache.pinot.plugin.inputformat.parquet.ParquetNativeRecordReader.next(ParquetNativeRecordReader.java:106) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader.next(ParquetRecordReader.java:64) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:67) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:42) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:172) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:153) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:102) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at org.apache.pinot.tools.admin.command.CreateSegmentCommand.lambda$execute$0(CreateSegmentCommand.java:247) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_292]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292]
  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Exception caught:
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Caught exception while generating segment from file: /data/data_019c5bcb-0401-e7fc-0019-bd01cc97e583_906_6_0.snappy.parquet
  at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_292]
  at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_292]
  at org.apache.pinot.tools.admin.command.CreateSegmentCommand.execute(CreateSegmentCommand.java:274) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:164) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:184) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
Caused by: java.lang.RuntimeException: Caught exception while generating segment from file: /data/data_019c5bcb-0401-e7fc-0019-bd01cc97e583_906_6_0.snappy.parquet
  at org.apache.pinot.tools.admin.command.CreateSegmentCommand.lambda$execute$0(CreateSegmentCommand.java:265) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-5b7023a4e75d91ea75d4f5f575d440b602bf3df6]
  at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_292]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_292]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_292]
  at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
specifically I'm using the
ParquetNativeRecordExtractor
I can optionally switch to using plain Avro (non-parquet) if for some reason Native Parquet is lacking some functionality.
FWIW the original source is a Snowflake table so I'm exporting into Parquet purely for ingestion into Pinot so the format is somewhat arbitrary.
n

Neha Pawar

05/20/2021, 10:55 PM
and what’s the pinot schema?
m

Mike Davis

05/20/2021, 11:41 PM
The new field was part of the Pinot schema as a datetime field:
Copy code
{
      "name": "NEW_FIELD_NAME",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:DAYS"
    },
I can dig more into on my end, good to know that support is there, but maybe there's an issue with the parquet reader.
n

Neha Pawar

05/21/2021, 7:01 PM
i will try to reproduce this on my end today