This message was deleted Apache Druid #dev

Join Slack

This message was deleted.

# dev

Slackbot

12/15/2022, 9:24 AM

This message was deleted.

Sergio Ferragut

12/15/2022, 6:53 PM

Sounds like a good improvement request would be to considers all source columns for the parquet reader including any used in

transforms

. Would you like to write one up here: https://github.com/apache/druid/issues

Maytas Monsereenusorn

12/15/2022, 8:40 PM

Just want to confirm if my understanding is correct and/or if anyone ran into similar issue with hadoop ingestion on parquet file.

Maytas Monsereenusorn

12/15/2022, 8:41 PM

I haven’t use hadoop / parser in a long time haha

Sergio Ferragut

12/15/2022, 8:43 PM

That's more than I've used it 😉 , but you seem to have uncovered a problem with it, so it would be good to document.

Maytas Monsereenusorn

12/15/2022, 8:46 PM

Why use hadoop when we have MSQE right? 😂

Sergio Ferragut

12/15/2022, 9:01 PM

Exactly! 😄 I wonder what the behavior is with MSQE + Parquet field selection... sounds like an interesting test.

Clint Wylie

12/15/2022, 10:21 PM

i vaguely remember this code, iirc part of the problem is that transforms happen on top of stuff so the part that handles the flattenSpec doesn’t have access to which columns are needed, only the final set of output dims

Clint Wylie

12/15/2022, 10:22 PM

native inputFormats sort of have this problem too in that there is no way to know the full set of columns required to be read from the input source, at least from within the format

Clint Wylie

12/15/2022, 10:22 PM

we might want to pass the transformSpec in .. somewhere

Clint Wylie

12/15/2022, 10:25 PM

for parquet it is more of a problem than others because it tries to trim the set of columns it reads because its format makes stuff a bit expensive to read stuff you don’t need, so its trying to avoid waste

Maytas Monsereenusorn

12/16/2022, 9:04 AM

I took a look at the parquet inputFormat and it doesn’t have this problem because it just read all the columns of the parquet file. It doesn’t even consider any of the Druid’s spec (dimensionSpec, metricSpec, etc.)

Maytas Monsereenusorn

12/16/2022, 9:06 AM

I am looking at ParquetReader#intermediateRowIterator

Copy code

final org.apache.parquet.hadoop.ParquetReader<Group> reader;

reader = closer.register(org.apache.parquet.hadoop.ParquetReader.builder(new GroupReadSupport(), path).withConf(conf).build());

Maytas Monsereenusorn

12/16/2022, 9:06 AM

The reader (that is reading in the parquet file) is built without any of Druid’s spec (dimensionSpec, metricSpec, etc.)

Maytas Monsereenusorn

12/16/2022, 9:07 AM

The conf in the withConf is hadoop-common Configuration class (which has nothing to do with Druid’s spec)

Maytas Monsereenusorn

12/16/2022, 9:08 AM

I think one way we can fix the Parquet Parser is to just return fullSchema

Maytas Monsereenusorn

12/16/2022, 9:08 AM

similar to what we already do if parseSpec instanceof ParquetParseSpec && flattenSpec != null

Maytas Monsereenusorn

12/16/2022, 9:39 AM

Although the

HadoopDruidIndexerConfig

(which we already gotten from

HadoopDruidIndexerConfig.fromConfiguration(context.getConfiguration());

), does have

DataSchema

spec. So we can get

transformSpec

. Although not sure how easy it is to get the source field from the

TransformSpec

. Took a quick look and maybe the

RowFunction

inside the

Transform

would need to expose a new method. In this new method, we would call

analyzeInputs

of the

expr

getting back

BindingAnalysis

. Then we can call

getRequiredBindings

. Maybe?

Clint Wylie

12/16/2022, 10:11 PM

I took a look at the parquet inputFormat and it doesn’t have this problem because it just read all the columns of the parquet file.

oh, that’s kind of sad

Clint Wylie

12/16/2022, 10:12 PM

it does seem useful for the hadoop reader and inputFormat both to be able to know the full set of required columns that are needed

Clint Wylie

12/16/2022, 10:13 PM

need to look at some stuff to see if there is an obvious way to do this

🙌 1

Maytas Monsereenusorn

12/16/2022, 11:44 PM

it does seem useful for the hadoop reader and inputFormat both to be able to know the full set of required columns that are needed

I agree. However, i think the most important thing is to fix hadoop ingestion of parquet file for the case described above. If it is not too much work, we can do the right fix (correctly getting required columns) straight away.

Maytas Monsereenusorn

12/16/2022, 11:46 PM

lmk if you have a chance to look at RowFunction/BindingAnalysis in the Transform

Maytas Monsereenusorn

12/16/2022, 11:46 PM

Also I think the flatten spec is applied after reading from parquet

Maytas Monsereenusorn

12/16/2022, 11:46 PM

so the columns used in flattenSpec (if not part of the dimensionSpec) must also be added to the set of required columns

Maytas Monsereenusorn

12/16/2022, 11:48 PM

oh actually we currently return fullSchema if there is a flattenSpec.

Maytas Monsereenusorn

12/16/2022, 11:48 PM

we can think if we want to optimize this too

Maytas Monsereenusorn

12/16/2022, 11:49 PM

Copy code

// this is kind of lame, maybe we can still trim what we read if we
// parse the flatten spec and determine it isn't auto discovering props?
if (parseSpec instanceof ParquetParseSpec) {
  if (((ParquetParseSpec) parseSpec).getFlattenSpec() != null) {
    return fullSchema;
  }
}

Maytas Monsereenusorn

01/03/2023, 5:07 AM

I put up a fix for this at https://github.com/apache/druid/pull/13612, in case anyone is interested

2 Views

Open in Slack

Previous Next