This message was deleted.
# dev
s
This message was deleted.
s
Sounds like a good improvement request would be to considers all source columns for the parquet reader including any used in
transforms
. Would you like to write one up here: https://github.com/apache/druid/issues
m
Just want to confirm if my understanding is correct and/or if anyone ran into similar issue with hadoop ingestion on parquet file.
I haven’t use hadoop / parser in a long time haha
s
That's more than I've used it 😉 , but you seem to have uncovered a problem with it, so it would be good to document.
m
Why use hadoop when we have MSQE right? 😂
s
Exactly! 😄 I wonder what the behavior is with MSQE + Parquet field selection... sounds like an interesting test.
c
i vaguely remember this code, iirc part of the problem is that transforms happen on top of stuff so the part that handles the flattenSpec doesn’t have access to which columns are needed, only the final set of output dims
native inputFormats sort of have this problem too in that there is no way to know the full set of columns required to be read from the input source, at least from within the format
we might want to pass the transformSpec in .. somewhere
for parquet it is more of a problem than others because it tries to trim the set of columns it reads because its format makes stuff a bit expensive to read stuff you don’t need, so its trying to avoid waste
m
I took a look at the parquet inputFormat and it doesn’t have this problem because it just read all the columns of the parquet file. It doesn’t even consider any of the Druid’s spec (dimensionSpec, metricSpec, etc.)
I am looking at ParquetReader#intermediateRowIterator
Copy code
final org.apache.parquet.hadoop.ParquetReader<Group> reader;

reader = closer.register(org.apache.parquet.hadoop.ParquetReader.builder(new GroupReadSupport(), path).withConf(conf).build());
The reader (that is reading in the parquet file) is built without any of Druid’s spec (dimensionSpec, metricSpec, etc.)
The conf in the withConf is hadoop-common Configuration class (which has nothing to do with Druid’s spec)
I think one way we can fix the Parquet Parser is to just return fullSchema
similar to what we already do if parseSpec instanceof ParquetParseSpec && flattenSpec != null
Although the
HadoopDruidIndexerConfig
(which we already gotten from
HadoopDruidIndexerConfig.fromConfiguration(context.getConfiguration());
), does have
DataSchema
spec. So we can get
transformSpec
. Although not sure how easy it is to get the source field from the
TransformSpec
. Took a quick look and maybe the
RowFunction
inside the
Transform
would need to expose a new method. In this new method, we would call
analyzeInputs
of the
expr
getting back
BindingAnalysis
. Then we can call
getRequiredBindings
. Maybe?
c
I took a look at the parquet inputFormat and it doesn’t have this problem because it just read all the columns of the parquet file.
oh, that’s kind of sad
it does seem useful for the hadoop reader and inputFormat both to be able to know the full set of required columns that are needed
need to look at some stuff to see if there is an obvious way to do this
🙌 1
m
it does seem useful for the hadoop reader and inputFormat both to be able to know the full set of required columns that are needed
I agree. However, i think the most important thing is to fix hadoop ingestion of parquet file for the case described above. If it is not too much work, we can do the right fix (correctly getting required columns) straight away.
lmk if you have a chance to look at RowFunction/BindingAnalysis in the Transform
Also I think the flatten spec is applied after reading from parquet
so the columns used in flattenSpec (if not part of the dimensionSpec) must also be added to the set of required columns
oh actually we currently return fullSchema if there is a flattenSpec.
we can think if we want to optimize this too
Copy code
// this is kind of lame, maybe we can still trim what we read if we
// parse the flatten spec and determine it isn't auto discovering props?
if (parseSpec instanceof ParquetParseSpec) {
  if (((ParquetParseSpec) parseSpec).getFlattenSpec() != null) {
    return fullSchema;
  }
}
I put up a fix for this at https://github.com/apache/druid/pull/13612, in case anyone is interested