We have about 1500 segments in our HDFS deep store...
# general
k
We have about 1500 segments in our HDFS deep store directory. We push these to our Pinot cluster via a metadata push, so only the URI is sent to the controller, which works well. But when we add a single new segment, our push job still has to download/untar all 1500 segments, because we can’t specify a pattern to filter the output directory files to only the new file. We could add per-month subdirectories in HDFS to restrict the number of files being processed this way, but is there a better approach? Note that the files in HDFS can’t be moved around, as their deep store URIs are part of the Zookeeper state.
m
Perhaps we can support file name / pattern to select?
k
That was one thought I had, yes.
Essentially there’s an implicit pattern currently, as the filename has to end with “.tar.gz”
m
Yeah, should be easy to enhance.
k
per month and per day sub-directory is not a bad idea.