Has anyone recently used the spark batch ingestion...
# troubleshooting
b
Has anyone recently used the spark batch ingestion to ingest parquet files? If so what version of spark did you use? Do you have timestamp columns in your data? I'm having no end of issues with getting to to actually get it to ingest the data. Issues I've encountered so far: • JDK 11 issue with old apache commons lang3 version. Workaround: Update dependency in pinot • Parquet version mismatch between EMR 6.3.0 (spark 3.1.1) and pinot master causing methodNotFoundException issues. Workaround: rev parquet and avro to newer versions and shade in pinot. • INT96 timestamp type unsupported in parquet-avro integration. Workaround attempts include using native parquet reader (fails as below), trying with conf.set("parquet.avro.readInt96AsFixed", "true") which reads the timestamp as bytes but fails in DataTypeTransformer/PinotDataType when attempting to parse as long. • Native parquet reader fails with odd errors: FileNotFoundException: File does not exist: /mnt/yarn/usercache/hadoop/appcache/application_1626274417373_0005/container_1626274417373_0005_01_000004/tmp/pinot-f6020dd1-9bdf-4ac1-b1b8-343bb1af5a50/input/part-29508-674459c7-acf4-42b7-84f4-1752dd3ac7bd.c000.snappy.parquet -- no clue as to the cause of this one. • columns in the path (/data/TransactionDateYear=2016/TransactionDateMonth=02/someparquetfile.parquet) are not detected. I was hoping the native parquet reader might be smart enough to detect those but I think it's failing before then.
k
@Bruce Ritchie were you able to get past these issues and ingest successfully using Spark ? cc @Xiang Fu
b
Hi @Kulbir Nijjer - no. I've switched back to the druid poc for the next few weeks. If I come back to pinot I may write a dataframe writer that uses the new apis available in master. I think that is the best general approach to solving spark based ingestion into pinot.
👍 1