https://pinot.apache.org/ logo
k

Ken Krugler

12/15/2020, 1:32 AM
Hey all, I’m now running a segment generation/push that’s using HDFS for input/output. The relevant bits in the job file for input/output dir are:
Copy code
inputDirURI: 'hdfs://<clustername>/user/hadoop/pinot-input/'
includeFileNamePattern: 'glob:**/us_*.gz'
outputDirURI: 'hdfs://<clustername>/user/hadoop/pinot-segments/'
When I run the job, segments are generated, but then each segment fails with something like:
Copy code
Failed to generate Pinot segment for file - hdfs:/user/hadoop/pinot-input/us_2020-03_03.gz
java.lang.IllegalStateException: Unable to extract out the relative path based on base input path: hdfs://<clustername>/user/hadoop/pinot-input/
So it looks like the input file URI is getting the authority (
<clustername>
) stripped out, which is why the
baseInputDir.relativize(inputFile)
call fails to generate appropriate results in
SegmentGenerationUtils.getRelativeOutputPath
. Or is there something else I need to be doing here to get this to work properly? I’m able to read the files, so the
inputDirURI
is set up properly (along with HDFS jars).
x

Xiang Fu

12/15/2020, 1:36 AM
I think this is a bug, can you give an example of dir uri and file uri?
The authority is stripped out when fs list is happening ?
k

Ken Krugler

12/15/2020, 1:38 AM
I’m guessing the authority is being stripped somewhere, as the inputDirURI is correct, the files are being read, and it’s only when trying to create a relativized path for writing the files that the input file URI no longer contains the authority bit
Dir URI is
'hdfs://<clustername>/user/hadoop/pinot-input/'
(in job yml file). But input file URI is
hdfs:/user/hadoop/pinot-input/us_2020-03_03.gz
(no clustername)
x

Xiang Fu

12/15/2020, 1:42 AM
ic
Does it work if you remove the cluster name from input dir uri?
k

Ken Krugler

12/15/2020, 4:02 AM
Likely yes, but having trouble getting Pinot to pick up all of the Hadoop config settings - so set that directly in the URI.
OK, I think the issue is
getFileURI()
at line 247 in
SegmentGenerationJobRunner
. This method needs to also get the authority from the base input directory URI, as otherwise it gets just the path (the
/user/hadoop/pinot-input/somefile.gz
bit), sees that the resulting URI doesn’t have a protocol, and uses the provided protocol to construct the URI - but the authority has been lost.