Hi team, I have following issues and questions for...
# troubleshooting
Hi team, I have following issues and questions for pinot spark batch ingestion: 1. The ingestion job will fail when input parquet data contains timestamp type column, it seems to be related to INT96 timestamp type unsupported issue. Is the workaround here just preprocessing the data to cast it to other format? 2. Can the partition columns in the input path be understood in the ingestion job? eg: if I have s3://bucket/metrics/dt=2022-04-01/files.parquet input structure, can the dt column be ingested to a pinot table column directly? Seems like I am running into most of the issues in this thread: https://apache-pinot.slack.com/archives/C011C9JHN7R/p1626295708055100 I wonder if there are update on these problems in last 8 months 👀
1. Use native parquet reader
2. @User
I tried native parquet reader but it failed with same weird error in that old thread:
Copy code
Caused by: java.io.FileNotFoundException: File does not exist: /mnt/data/yarn/usercache/grace/appcache/application_1641717365497_8097/container_1641717365497_8097_01_000002/tmp/pinot-56f5f6ca-c206-4896-b51e-f490c8b04893/input/part-00070-f1c41c9a-8f47-4ef2-9d2b-376ad995e591.c000
for 2. I don’t think it’s feasible right now. potential workaround could be: 1) derive the dt value from a column still in the data file (pinot ingestion transformation); or 2) maybe try this to keep the partitioning columns when writing out files with spark.
yeah it’s a bit tricky for us because we don’t have a time column in the data file that carries the same information that we can derive from. I think we will need to further process the upstream data to prepare it for pinot ingestion. Thanks for looking! 🙏
👌 1