Is anyone else using Pinot’s Hadoop map-reduce sup...
# troubleshooting
k
Is anyone else using Pinot’s Hadoop map-reduce support for building segments? Asking because after switching to 0.8 (from 0.7.1) it no longer works (issue with not finding HDFS plugin), plus some other odd issues.
m
@Jack ^^
j
LinkedIn uses the one in
/pinot-plugins/pinot-batch-ingestion/v0_deprecated/pinot-hadoop
dir
And the latest version we’re using is
de2f0e04dca8130a09ea902787a75997b70cc16d
in github repo.
k
Thanks @Jack
k
whats the issue?
m
@Ken Krugler If you could share details about the issue you are seeing that would be helpful.
k
Hi @Kishore G & @Mayank - thanks, I’ve been waiting for the ops team to resolve what seems like a logging issue (not seeing output that should be there), as then I’ll have better details on what I think is going on. But main issue is plugins (HDFS, specifically) don’t seem to be on the classpath when the map task starts up, even though the tarball is created and exists in the Yarn distributed cache (and contains the expected jars).
m
@Kulbir Nijjer this sounds similar to the spark issue we saw?
k
@Mayank @Kulbir Nijjer - is this captured by a GitHub issue? I did a quick scan and didn’t see anything that looked similar.
I found the discussion (from 4 days ago) with @Kulbir Nijjer and @Nisheet about failures using Spark, which I assume is what @Mayank meant by “the spark issue”.
m
Yeah, I was referring to the issue you found @Ken Krugler.
n
@Ken Krugler Have you tried explicitly specifying pinot-hdfs plugin in extrsJavaOptions. I too faced some issue with some hdfs class not found. I added it in extraJavaOptions to resolve this
spark.driver.extraJavaOptions=-Dplugins.include=pinot-s3,pinot-parquet,pinot-hdfs
k
@Ken Krugler yes there are some oddities around how things are working in certain Spark versions vs. others with same exact steps and we are looking into it. On Spark side in between using
dependencyJarDir
in ingestion yaml (need to manually upload all plugin jars to HDFS/S3/GCS path) and using
--jars
has worked well in my testing. Just by using
-Dplugins
in the command should cause jars to be packaged and added to YARN distributed cache but somehow that's not always happening. In any case would be good to first get a stack trace of exact error to confirm if this might be known issue or new one.
k
I feel the plugin concept is hard where the external systems (such as hadoop/spark) launch the process
we should probably have maven profiles to build the an all-in-one jar for hadoop and spark