https://pinot.apache.org/ logo
#pinot-dev
Title
# pinot-dev
k

Ken Krugler

01/27/2021, 9:24 PM
I’ve run into a few bugs in Pinot caused by
PinotFS.listFiles()
implementations not returning the protocol with the path. So you get back
/user/hadoop/blah
, not
hdfs:///user/hadoop/blah
. When those paths get used later, without knowledge of the file system, then you run into problems. Does anyone know why
listFiles()
(and maybe other methods in a PinotFS implementation) don’t include the protocol?
x

Xiang Fu

01/28/2021, 12:37 AM
I feel those are bugs in general, it should give the scheme.
k

Ken Krugler

01/28/2021, 12:38 AM
Wonder what would break with a change like that 🙂
x

Xiang Fu

01/28/2021, 12:39 AM
@Ting Chen are you using this ?
t

Ting Chen

01/28/2021, 12:49 AM
which PinotFS subclass are you using? HadoopFS? we are using a Uber internal Pinot FS for HDFS.
k

Ken Krugler

01/28/2021, 12:50 AM
@Ting Chen HadoopFS
t

Ting Chen

01/28/2021, 12:57 AM
Copy code
if (_hadoopFS.exists(path)) {
  // _hadoopFS.listFiles(path, false) will not return directories as files, thus use listStatus(path) here.
  List<FileStatus> files = listStatus(path, recursive);
  for (FileStatus file : files) {
    filePathStrings.add(file.getPath().toUri().getRawPath());
  }
} else {
I think getRawPath() is the issue
our alternative impl use getPath() which is similar. so far we did not get into much issues. May I know what problems did you encounter?
k

Ken Krugler

01/28/2021, 1:03 AM
E.g. with Hadoop batch segment generation, the input paths (written out to temp files, for processing in the job) are all
/user/hadoop/blah
paths, without the scheme, so the mapper then doesn’t know how to process them (since in theory they could be
file:///user/hadoop/blah
paths)
And in the stand-alone segment generation code, this same issue meant the code called a routine to “fix up” the paths, but that fix up code had a bug (that I recently fixed) where it would drop the namenode part of the path. So you’d get
hdfs:///user/hadoop/blah
, not the required
<hdfs://server.com:9000/user/hadoop/blah>
Haven’t tried this…does the PinotFS for files return
file:///user/blah
, or just
/user/blah
?
t

Ting Chen

01/28/2021, 1:43 AM
Based on the API of PinotFS:
Copy code
/**
 * Lists all the files and directories at the location provided.
 * Lists recursively if {@code recursive} is set to true.
 * Throws IOException if this abstract pathname is not valid, or if an I/O error occurs.
 * @param fileUri location of file
 * @param recursive if we want to list files recursively
 * @return an array of strings that contains file paths
 * @throws IOException on IO failure. See specific implementation
 */
public abstract String[] listFiles(URI fileUri, boolean recursive)
    throws IOException;
It return file "paths" which is technically the path component of the URI w/o scheme/authority and so on
so the HadoopFS impl is conforming to the interface here. For your issue, can you get the scheme from URI?
k

Ken Krugler

01/29/2021, 11:55 PM
@Ting Chen - not sure what you mean by “from URI”. These paths are stored as text in files that are then provided as the input files to the Hadoop job. The Hadoop mapper can’t blindly assume that they’re HDFS. It could use the provided segment generation job spec to get the URI for the input dir and extract the scheme from that, but feels both awkward and fragile. And if you look at the code, you’ll see lots of places where raw paths are getting “fixed up” to add back in the scheme. Maybe a new listPaths() or listUris() method for PinotFS implementations?
t

Ting Chen

01/30/2021, 12:44 AM
oh. by "from URI", I mean the 1st input parameter of the listFile() method. I was talk about deriving scheme from that URI. btw, I do agree we can add a new variation of listFiles() to return a full uri instead of the path component only.