I have parquet data placed in the S3 under prefix such as <s Apache Pinot #general

I have parquet data placed in the S3 under prefix ...

Shishpal Vishnoi

10/01/2021, 3:09 AM

I have parquet data placed in the S3 under prefix such as s3://my_bucket/logs/year=2018/month=01/day=23/*. I want to use these partitions(year, month, day) to filter data by partition value in pinot. How can I do it?

Shishpal Vishnoi

10/01/2021, 3:10 AM

AWS Athena: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

Mayank

10/01/2021, 3:22 AM

Currently, you need to push data to Pinot to be able to query it: https://docs.pinot.apache.org/users/tutorials/ingest-parquet-files-from-s3-using-spark

Shishpal Vishnoi

10/01/2021, 3:30 AM

Yes I am able to push parquet files but during push I want to create partition based on S3 prefix (data is already partitioned in s3 and i want to take benefit of that: s3://my_bucket/logs/year=2018/month=01/day=23/*) eg here year, month and day

Shishpal Vishnoi

10/01/2021, 3:34 AM

In Athena during table creation we can pass partitioned by and it will handle it. Please check scenario 1 in below doc. I am looking for a way to do it in pinot. https://docs.aws.amazon.com/athena/latest/ug/partitions.html

Yugandhar

10/01/2021, 8:52 AM

This is general path format supported by spark, it won't store folder path values (partition values like year, month, day, country) in parquet file. https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery

Yugandhar

10/01/2021, 8:52 AM

This also applies to CSV, Json, Text, ... not just parquet

Yugandhar

10/03/2021, 8:50 AM

Anybody has any solution for that.

Mayank

10/15/2021, 4:15 AM

@User @User for time column, as long as you are pushing partitioned data into Pinot, it will be used to prune out segments for better query performance.

Mayank

10/15/2021, 4:57 AM

Updating the thread with offline discussion. Essentially the ask here is to add derived columns for day/month/year by date partition. This can be achieved by using derived columns during ingestion: https://docs.pinot.apache.org/users/tutorials/schema-evolution#derived-column

Mayank

10/15/2021, 4:30 PM

@User while should work, it would be a good enhancement to have in the spark ingestion job, please file a github issue for the same.

Open in Slack

Previous Next