I have parquet data placed in the S3 under prefix ...
# general
s
I have parquet data placed in the S3 under prefix such as s3://my_bucket/logs/year=2018/month=01/day=23/*. I want to use these partitions(year, month, day) to filter data by partition value in pinot. How can I do it?
m
Currently, you need to push data to Pinot to be able to query it: https://docs.pinot.apache.org/users/tutorials/ingest-parquet-files-from-s3-using-spark
s
Yes I am able to push parquet files but during push I want to create partition based on S3 prefix (data is already partitioned in s3 and i want to take benefit of that: s3://my_bucket/logs/year=2018/month=01/day=23/*) eg here year, month and day
In Athena during table creation we can pass partitioned by and it will handle it. Please check scenario 1 in below doc. I am looking for a way to do it in pinot. https://docs.aws.amazon.com/athena/latest/ug/partitions.html
y
This is general path format supported by spark, it won't store folder path values (partition values like year, month, day, country) in parquet file. https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery
This also applies to CSV, Json, Text, ... not just parquet
Anybody has any solution for that.
m
@User @User for time column, as long as you are pushing partitioned data into Pinot, it will be used to prune out segments for better query performance.
Updating the thread with offline discussion. Essentially the ask here is to add derived columns for day/month/year by date partition. This can be achieved by using derived columns during ingestion: https://docs.pinot.apache.org/users/tutorials/schema-evolution#derived-column
@User while should work, it would be a good enhancement to have in the spark ingestion job, please file a github issue for the same.