Apache Pinot

Is anyone else using Pinot’s Hadoop map-reduce support for building segments? Asking because after switching to 0.8 (from 0.7.1) it no longer works (issue with not finding HDFS plugin), plus some other odd issues.

LinkedIn uses the one in `/pinot-plugins/pinot-batch-ingestion/v0_deprecated/pinot-hadoop` dir

And the latest version we’re using is `de2f0e04dca8130a09ea902787a75997b70cc16d` in github repo.

<@U01ECR21TEE> If you could share details about the issue you are seeing that would be helpful.

Hi <@UDRJ7G85T> &amp; <@UDQU92KBK> - thanks, I’ve been waiting for the ops team to resolve what seems like a logging issue (not seeing output that should be there), as then I’ll have better details on what I think is going on. But main issue is plugins (HDFS, specifically) don’t seem to be on the classpath when the map task starts up, even though the tarball is created and exists in the Yarn distributed cache (and contains the expected jars).

<@U01TGKU0GQ3> this sounds similar to the spark issue we saw?

<@UDQU92KBK> <@U01TGKU0GQ3> - is this captured by a GitHub issue? I did a quick scan and didn’t see anything that looked similar.

I found the discussion (from 4 days ago) with <@U01TGKU0GQ3> and <@U02AAQ2P1HA> about failures using Spark, which I assume is what <@UDQU92KBK> meant by “the spark issue”.

Yeah, I was referring to the issue you found <@U01ECR21TEE>.

<@U01ECR21TEE> Have you tried explicitly specifying pinot-hdfs plugin in extrsJavaOptions. I too faced some issue with some hdfs  class not found. I added it in extraJavaOptions to resolve this
 `spark.driver.extraJavaOptions=-Dplugins.include=pinot-s3,pinot-parquet,pinot-hdfs`

<@U01ECR21TEE> yes there are some oddities around how things are working in certain Spark versions vs. others with same exact steps and we are looking into it. On Spark side in between using `dependencyJarDir` in ingestion yaml (need to manually upload all plugin jars to HDFS/S3/GCS path) and using `--jars` has worked well in my testing. Just by using `-Dplugins` in the command should cause jars to be packaged and added to YARN distributed cache but somehow that's not always happening. In any case would be good to first get a stack trace of exact error to confirm if this might be known issue or new one.

I feel the plugin concept is hard where the external systems (such as hadoop/spark) launch the process

we should probably have maven profiles to build the an all-in-one jar for hadoop and spark