Slackbot
12/15/2022, 9:24 AMSergio Ferragut
12/15/2022, 6:53 PMtransforms
. Would you like to write one up here: https://github.com/apache/druid/issuesMaytas Monsereenusorn
12/15/2022, 8:40 PMMaytas Monsereenusorn
12/15/2022, 8:41 PMSergio Ferragut
12/15/2022, 8:43 PMMaytas Monsereenusorn
12/15/2022, 8:46 PMSergio Ferragut
12/15/2022, 9:01 PMClint Wylie
12/15/2022, 10:21 PMClint Wylie
12/15/2022, 10:22 PMClint Wylie
12/15/2022, 10:22 PMClint Wylie
12/15/2022, 10:25 PMMaytas Monsereenusorn
12/16/2022, 9:04 AMMaytas Monsereenusorn
12/16/2022, 9:06 AMfinal org.apache.parquet.hadoop.ParquetReader<Group> reader;
reader = closer.register(org.apache.parquet.hadoop.ParquetReader.builder(new GroupReadSupport(), path).withConf(conf).build());
Maytas Monsereenusorn
12/16/2022, 9:06 AMMaytas Monsereenusorn
12/16/2022, 9:07 AMMaytas Monsereenusorn
12/16/2022, 9:08 AMMaytas Monsereenusorn
12/16/2022, 9:08 AMMaytas Monsereenusorn
12/16/2022, 9:39 AMHadoopDruidIndexerConfig
(which we already gotten from
HadoopDruidIndexerConfig.fromConfiguration(context.getConfiguration());
),
does have DataSchema
spec. So we can get transformSpec
. Although not sure how easy it is to get the source field from the TransformSpec
. Took a quick look and maybe the RowFunction
inside the Transform
would need to expose a new method. In this new method, we would call analyzeInputs
of the expr
getting back BindingAnalysis
. Then we can call getRequiredBindings
. Maybe?Clint Wylie
12/16/2022, 10:11 PMI took a look at the parquet inputFormat and it doesn’t have this problem because it just read all the columns of the parquet file.oh, that’s kind of sad
Clint Wylie
12/16/2022, 10:12 PMClint Wylie
12/16/2022, 10:13 PMMaytas Monsereenusorn
12/16/2022, 11:44 PMit does seem useful for the hadoop reader and inputFormat both to be able to know the full set of required columns that are neededI agree. However, i think the most important thing is to fix hadoop ingestion of parquet file for the case described above. If it is not too much work, we can do the right fix (correctly getting required columns) straight away.
Maytas Monsereenusorn
12/16/2022, 11:46 PMMaytas Monsereenusorn
12/16/2022, 11:46 PMMaytas Monsereenusorn
12/16/2022, 11:46 PMMaytas Monsereenusorn
12/16/2022, 11:48 PMMaytas Monsereenusorn
12/16/2022, 11:48 PMMaytas Monsereenusorn
12/16/2022, 11:49 PM// this is kind of lame, maybe we can still trim what we read if we
// parse the flatten spec and determine it isn't auto discovering props?
if (parseSpec instanceof ParquetParseSpec) {
if (((ParquetParseSpec) parseSpec).getFlattenSpec() != null) {
return fullSchema;
}
}
Maytas Monsereenusorn
01/03/2023, 5:07 AM