https://pinot.apache.org/ logo
j

Jonathan Meyer

05/26/2021, 8:40 AM
Hello 🙂 When ingesting Batch data + data partitioning (Parquet) using a key, that key is "missing" from the parquet file parts (makes sense) However, from what I've seen, Pinot cannot find that key then, and fails to generate the segments My current workaround is to duplicate the partition column. Is that a known issue / possible to adjusts settings ?
x

Xiang Fu

05/26/2021, 9:56 AM
do you have stacktrace for the job? The key should be a column in your data even in batch side
j

Jonathan Meyer

05/26/2021, 1:03 PM
The schema contains several columns including
dateString
which it is partitionned on This creates parquet partitions without this key
Actually, now that I look at it again, I'm seeing
Copy code
file:/kpi-data/raw/date=2020-11-30/ab331a05255849bf811a173a380aaf1d.parquet
Not
dateString=XXX
Curious but I'll check that
x

Xiang Fu

05/26/2021, 6:24 PM
oic, cause the default null string caused the parsing failure
âž• 1
this date has to be one column in your parquet file
if you generated this parquet from spark, you can add the partitionkey as a column as well
j

Jonathan Meyer

05/26/2021, 6:26 PM
If the parquet partitioning key was the one expected by Pinot (
dateString
), it would have worked, right ? (Pulling
dateString
values from the file paths)
x

Xiang Fu

05/26/2021, 6:28 PM
yes
the error says the job tries to generate the partitionkey but got null value
so it’s failed
j

Jonathan Meyer

05/26/2021, 6:29 PM
Makes sense, thanks @Xiang Fu :)
On an unrelated note, I've opened an issue on Python pinot-db driver, let me know what you think when you've got the time ;)
x

Xiang Fu

05/26/2021, 6:35 PM
sounds good!