I am trying to fetch PARQUET files from s3 and loa...
# general
m
I am trying to fetch PARQUET files from s3 and load into pinot DB. I am using offline table. I am running this command with my job spec ./bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile examples/batch/metrics/ingestionJobSpec.yaml I am seeing the following errors, any idea how to solve this issue ? Jan 13, 2021 63424 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr using format: (.+) version ((.) )?\(build ?(.)\) at org.apache.parquet.VersionParser.parse(VersionParser.java:112) at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431) at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:238) at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:234) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPFailed to generate Pinot segment for file - s3://cdca-metrics-prod-us-east-1-eedr/eedr/events/event_date=2021-01-12/event_hour=12/20210112_235508_00031_tgepm_5672f969-021f-4dfd-a0ad-c209aaf7e84d java.lang.IllegalArgumentException: INT96 not yet implemented. at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:251) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:236) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:222) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:235) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:215) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:209) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:124) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f]
x
The issue here is that Pinot using parquet avro lib to read it, which doesn’t understand int96 type
Is it possible to convert it to int64?
m
you mean in table schema
x
Yes, if it’s still not working, then we may need some fix on that to bypass int96
m
I changed to BIGINT in parquet and to unix timestamp in input file. I am using long in my schema. but I am seeing this error any idea
@Xiang Fu Failed to generate Pinot segment for file - s3://cdca-metrics-prod-us-east-1-eedr/eedr/events/event_date=2021-01-12/event_hour=12/20210114_191731_00060_czckc_0bf22f8d-9a13-4f39-aace-75478f92260e java.lang.IllegalStateException: Invalid segment start/end time: 5031-12-29T230000.000Z/5032-01-01T110000.000Z (in millis: 96627164400000/96627380400000) for time column: ingress_timestamp, must be between: 1971-01-01T000000.000Z/2071-01-01T000000.000Z
x
just put 1:1 msgs here to complete the loop and for future reference. The issue is that time value in raw data is in second epoch format and we need to modify schema to update granularity from
1:MINUTES
to `1:SECONDS`:
Copy code
"dateTimeFieldSpecs": [{
  "name": "ts",
  "dataType": "LONG",
  "format" : "1:SECONDS:EPOCH",
  "granularity": "1:MINUTES" -> "1:SECONDS"
 }],
m
@Xiang Fu I am uploading data from s3 to pinot db using job spec. when files are big like 300 mb, it is not uploading segment. is there any size limit for this job? ./bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile examples/batch/metrics/ingestionJobSpec.yaml
x
Hmm, did you see any exceptions for the job ?
As long as there is enough memory, the creation won’t fail
m
dont see any error as log level is INFO
where do you change log levels
x
can you try
Copy code
bin/pinot-ingestion-job.sh -jobSpecFile ~/my/data/ingestionJobSpec.yaml
it uses the log4j conf:
conf/pinot-ingestion-job-log4j2.xml
which gives more details
where do you run this command? on bare metal or in container?