Hi team I have following issues and questions for pinot spar Apache Pinot #troubleshooting

Hi team, I have following issues and questions for...

Grace Lu

04/11/2022, 7:37 PM

Hi team, I have following issues and questions for pinot spark batch ingestion: 1. The ingestion job will fail when input parquet data contains timestamp type column, it seems to be related to INT96 timestamp type unsupported issue. Is the workaround here just preprocessing the data to cast it to other format? 2. Can the partition columns in the input path be understood in the ingestion job? eg: if I have s3://bucket/metrics/dt=2022-04-01/files.parquet input structure, can the dt column be ingested to a pinot table column directly? Seems like I am running into most of the issues in this thread: https://apache-pinot.slack.com/archives/C011C9JHN7R/p1626295708055100 I wonder if there are update on these problems in last 8 months 👀

Mayank

04/11/2022, 7:38 PM

1. Use native parquet reader

Mayank

04/11/2022, 7:39 PM

2. @User

Grace Lu

04/11/2022, 7:41 PM

I tried native parquet reader but it failed with same weird error in that old thread:

Copy code

Caused by: java.io.FileNotFoundException: File does not exist: /mnt/data/yarn/usercache/grace/appcache/application_1641717365497_8097/container_1641717365497_8097_01_000002/tmp/pinot-56f5f6ca-c206-4896-b51e-f490c8b04893/input/part-00070-f1c41c9a-8f47-4ef2-9d2b-376ad995e591.c000

Xiaobing

04/11/2022, 8:41 PM

for 2. I don’t think it’s feasible right now. potential workaround could be: 1) derive the dt value from a column still in the data file (pinot ingestion transformation); or 2) maybe try this to keep the partitioning columns when writing out files with spark.

Grace Lu

04/11/2022, 10:27 PM

yeah it’s a bit tricky for us because we don’t have a time column in the data file that carries the same information that we can derive from. I think we will need to further process the upstream data to prepare it for pinot ingestion. Thanks for looking! 🙏

👌 1

Open in Slack

Previous Next