Hi, We are in the process of setting up ETL pipel...
# getting-started
p
Hi, We are in the process of setting up ETL pipelines. The processed data will be stored on S3 in parquet format using datetime column for partitioning. The idea is to connect pinot offline table with S3 and let minions handle SegmentGenerationAndPush tasks. Can pinot handle parquet files with snappy/lz4 compression? What about dictionary encoding? I do not see any documentation on how to add custom config for ParquetRecordReader here https://docs.pinot.apache.org/basics/data-import/pinot-input-formats Could someone point me to the right place to read more about it? Or any tips for me in general on how to efficiently store data on S3 using parquet format? Thanks!
l
<https://dev.startree.ai/docs/startree-release-notes/0.6.0#support-for-gz-file-ingestion>
a
it can definitely do snappy with
ParquetNativeRecordReader
(that's what we do) not sure about lz4 or dictionary, but should be easy enough to test locally? (
ParquetRecordReader
might work too we just haven't tried)
x
The parquet reader should handle the compression seamlessly.
Your use case is a typical workflow.
a
@Pratik Bhadane