I am trying to fetch PARQUET files from s3 and load into pin Apache Pinot #general

I am trying to fetch PARQUET files from s3 and loa...

Mahesh Yeole

01/14/2021, 3:33 AM

I am trying to fetch PARQUET files from s3 and load into pinot DB. I am using offline table. I am running this command with my job spec ./bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile examples/batch/metrics/ingestionJobSpec.yaml I am seeing the following errors, any idea how to solve this issue ? Jan 13, 2021 63424 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr using format: (.+) version ((.) )?$build ?(.)$ at org.apache.parquet.VersionParser.parse(VersionParser.java:112) at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431) at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:238) at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:234) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPFailed to generate Pinot segment for file - s3://cdca-metrics-prod-us-east-1-eedr/eedr/events/event_date=2021-01-12/event_hour=12/20210112_235508_00031_tgepm_5672f969-021f-4dfd-a0ad-c209aaf7e84d java.lang.IllegalArgumentException: INT96 not yet implemented. at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:251) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:236) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:222) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:235) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:215) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:209) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f] at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:124) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-125402b4b3595d61fcc702ba57143d927b00fe7f]

Xiang Fu

01/14/2021, 3:47 AM

The issue here is that Pinot using parquet avro lib to read it, which doesn’t understand int96 type

Xiang Fu

01/14/2021, 3:47 AM

Is it possible to convert it to int64?

Mahesh Yeole

01/14/2021, 4:39 AM

you mean in table schema

Xiang Fu

01/14/2021, 6:28 AM

Yes, if it’s still not working, then we may need some fix on that to bypass int96

Mahesh Yeole

01/14/2021, 7:48 PM

I changed to BIGINT in parquet and to unix timestamp in input file. I am using long in my schema. but I am seeing this error any idea

Mahesh Yeole

01/14/2021, 7:49 PM

@Xiang Fu Failed to generate Pinot segment for file - s3://cdca-metrics-prod-us-east-1-eedr/eedr/events/event_date=2021-01-12/event_hour=12/20210114_191731_00060_czckc_0bf22f8d-9a13-4f39-aace-75478f92260e java.lang.IllegalStateException: Invalid segment start/end time: 5031-12-29T230000.000Z/5032-01-01T110000.000Z (in millis: 96627164400000/96627380400000) for time column: ingress_timestamp, must be between: 1971-01-01T000000.000Z/2071-01-01T000000.000Z

Xiang Fu

01/15/2021, 5:31 AM

just put 1:1 msgs here to complete the loop and for future reference. The issue is that time value in raw data is in second epoch format and we need to modify schema to update granularity from

1:MINUTES

to `1:SECONDS`:

Copy code

"dateTimeFieldSpecs": [{
  "name": "ts",
  "dataType": "LONG",
  "format" : "1:SECONDS:EPOCH",
  "granularity": "1:MINUTES" -> "1:SECONDS"
 }],

Mahesh Yeole

01/15/2021, 5:30 PM

@Xiang Fu I am uploading data from s3 to pinot db using job spec. when files are big like 300 mb, it is not uploading segment. is there any size limit for this job? ./bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile examples/batch/metrics/ingestionJobSpec.yaml

Xiang Fu

01/15/2021, 6:56 PM

Hmm, did you see any exceptions for the job ?

Xiang Fu

01/15/2021, 6:56 PM

As long as there is enough memory, the creation won’t fail

Mahesh Yeole

01/15/2021, 7:04 PM

dont see any error as log level is INFO

Mahesh Yeole

01/15/2021, 7:04 PM

where do you change log levels

Xiang Fu

01/15/2021, 7:30 PM

can you try

Copy code

bin/pinot-ingestion-job.sh -jobSpecFile ~/my/data/ingestionJobSpec.yaml

Xiang Fu

01/15/2021, 7:30 PM

it uses the log4j conf:

conf/pinot-ingestion-job-log4j2.xml

Xiang Fu

01/15/2021, 7:30 PM

which gives more details

Xiang Fu

01/15/2021, 7:31 PM

where do you run this command? on bare metal or in container?

Open in Slack

Previous Next