https://pinot.apache.org/ logo
a

Aaron Wishnick

05/07/2021, 8:29 PM
I'm getting a lot of errors when trying to run a SegmentCreation job on some Parquet files that were written out by Trino 354. I'll put the errors in a thread. Any ideas?
Copy code
java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file DeprecatedRawLocalFileStatus{path=file:/tmp/pinot-7dd1e9e9-b1bd-416c-ab
4b-1e66a887d7ca/input/20210507_201917_26196_rv8nu_14b41b59-66e3-4f97-9df0-56c76d859102; isDirectory=false; length=3804; replication=1; blocksize=33554432; modification_time=1620419283015; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}                                                                             
        at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694]
...
Copy code
Caused by: java.io.IOException: can not read class org.apache.parquet.format.FileMetaData: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT32, e
ncodings:[BIT_PACKED, PLAIN_DICTIONARY, RLE], path_in_schema:[date], codec:null, num_values:2, total_uncompressed_size:54, total_compressed_size:72, data_page_offset:4
, statistics:Statistics(max:7F 62 34 01, min:7F 62 34 01, null_count:0), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:
1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)])
m

Mayank

05/07/2021, 8:32 PM
Seems issue in parquet reader (outside Pinot)
Does this look familiar ^^
a

Aaron Wishnick

05/07/2021, 8:45 PM
Hmm, that does look like the same error message. But this github issue is for parquetjs writing files that can't be read by other clients
m

Mayank

05/07/2021, 8:46 PM
Perhaps the same underlying root cause? Point being my guess is the issue is completely outside of Pinot
a

Aaron Wishnick

05/07/2021, 8:53 PM
Ok
I can also read this parquet file using pandas/pyarrow
I understand this is library code outside Pinot, but Pinot is supposed to be able to read parquet files, right?
m

Mayank

05/07/2021, 8:54 PM
Yeah, that's true
Pinot definitely should either read it correctly or fail gracefully if it cannot.
I was just giving pointers to debug
Do you have sample parquet file that we can debug from IDE?
a

Aaron Wishnick

05/07/2021, 9:01 PM
I'm so sorry, due to IP concerns I can't share the file directly but if I can come up with a minimal repro I will
Is parquet the preferred format for getting data into Pinot or is there an easier format I should try?
m

Mayank

05/07/2021, 9:02 PM
There's no preferred format, all of them just implement a RecordReader interface. If you want to just get around and move fast, you can try any other format avro/json/orc/..
But if you can indeed repro using a dummy input, we can help debug
Alternatively you can just debug your IP file. You should be able to run the same code from within IDE with minimal effort.
a

Aaron Wishnick

05/07/2021, 9:22 PM
Ok, from what I can tell, the issue is that Trino defaults to using ZSTD compression and Pinot seems not to like that
If I compress using GZIP it works
m

Mayank

05/07/2021, 9:23 PM
I see. Could you please file an issue? Pinot should auto detect this perhaps
a

Aaron Wishnick

05/07/2021, 9:29 PM
Sure thing
m

Mayank

05/07/2021, 9:29 PM
thankyou
m

Mayank

05/07/2021, 9:32 PM
That was quick, end-to-end 👏
x

Xiang Fu

05/07/2021, 9:53 PM
are you using this
org.apache.pinot.plugin.inputformat.parquet.ParquetAvroRecordReader
?
can you try this :
org.apache.pinot.plugin.inputformat.parquet.ParquetNativeRecordReader
a

Aaron Wishnick

05/07/2021, 10:45 PM
I've been using
org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader
(I think that's what the docs said to do)
x

Xiang Fu

05/07/2021, 11:18 PM
I think it’s the default to use AvroParquet
you can add a config to switch to NativeParquetReader
a

Aaron Wishnick

05/10/2021, 2:07 PM
Oh interesting, thanks!
x

Xiang Fu

05/10/2021, 7:03 PM
plz give it a try and let me know. Or if you can share a sample random file, that would be super helpful for us to make the test and fix the issue
m

Mayank

05/18/2021, 5:06 PM
@Tim Santos ^^
2 Views