This message was deleted Apache Druid #dev

Join Slack

This message was deleted.

# dev

Slackbot

06/22/2023, 8:57 PM

This message was deleted.

Clint Wylie

06/22/2023, 9:01 PM

will check this out in a bit

Clint Wylie

06/22/2023, 11:26 PM

For instance, it’s possible for an InputFormat to natively populate the timestamp field of an InputRow without creating a separate “column” for it. However it then becomes impossible to reference that in a timestamp spec.

assuming all input formats these days use the map based input row, it seems not very cool that stuff would leave the source time column out of the raw event

Clint Wylie

06/22/2023, 11:27 PM

it makes sense to me that the time column would be excluded from the dimensions list, but not being available via getRaw seems incorrect

Clint Wylie

06/22/2023, 11:28 PM

naively it seems like __time shouldn’t be in the raw event map unless there was in fact a column in the underlying input with that name

Clint Wylie

06/22/2023, 11:29 PM

ive been discussing with eric a bit and we’ve been wondering if we should just dump

InputRow

and start dealing with plain Map directly, since

InputRow

seems to just increase the chances of making mistakes

Clint Wylie

06/22/2023, 11:30 PM

then dimensions and exclusions and everything else are a concern for the thing processing the map instead of something we push down into the row

Clint Wylie

06/22/2023, 11:30 PM

but thats a bit of a task

Clint Wylie

06/22/2023, 11:33 PM

i think input time column eventually needs excluded from the ‘dimensions’ list of the row spit out by

getDimensions

or else it will end up becoming part of the row key iirc which would interfere with rollup (i could be misremembering)

Xavier

06/22/2023, 11:34 PM

I think it’s not clear which columns should be in the input row.

Clint Wylie

06/22/2023, 11:35 PM

yea

Clint Wylie

06/22/2023, 11:35 PM

for map based row i think like, all of them probably? which i think is part of why we were discussing if InputRow is really that useful

Xavier

06/22/2023, 11:36 PM

Even with a map it’s not clear what’s in dimensions

Xavier

06/22/2023, 11:36 PM

It doesn’t always match the underlying map keys

Clint Wylie

06/22/2023, 11:36 PM

they should all be available via getRaw at least

Clint Wylie

06/22/2023, 11:37 PM

i agree its confusing what getDimension. getDimensions, and getMetric are supposed to do

Clint Wylie

06/22/2023, 11:38 PM

which leads to this complicated logics in the map row maker thing, https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/data/input/impl/MapInputRowParser.java#L101

Clint Wylie

06/22/2023, 11:39 PM

which sort of relies on something externally populating the list of metrics which should be excluded as well as the timestamp input column

Xavier

06/22/2023, 11:39 PM

The role of the time stamp is not clear either

Clint Wylie

06/22/2023, 11:39 PM

where presumably the exclusions are the metrics

Clint Wylie

06/22/2023, 11:40 PM

i think row.getTimestamp is supposed to be the way to get the __time column value for the row, regardless of what input column it came from via the timestamp spec and doing any translation necessary

Clint Wylie

06/22/2023, 11:41 PM

since like the input column might have been a string or whatever else and isn’t likely consumable as a datetime directly

Clint Wylie

06/22/2023, 11:42 PM

the more we discuss this i think the more in favor i am of just dropping inputRow and making IncrementalIndex do all the work 😅

Clint Wylie

06/22/2023, 11:42 PM

then the map is just the plain java type representation of what the input reader spits out

Clint Wylie

06/22/2023, 11:42 PM

for the kafka format it would wrap/decorate the payloads map

Xavier

06/22/2023, 11:42 PM

Right, for transforms we have to hoist it back into a time could

Clint Wylie

06/22/2023, 11:42 PM

and transforms wrap that

Clint Wylie

06/22/2023, 11:43 PM

and then incremental index can read stuff out however it wants and just treat the leftovers of aggs/transforms that are agg inputs/timestamp as the discovered dimensions

Clint Wylie

06/22/2023, 11:44 PM

i have a fear that the old hadoop parsers might make this easier said than done though

Clint Wylie

06/22/2023, 11:44 PM

i wish it could just .. go away

Xavier

06/22/2023, 11:44 PM

In think the api contract for how input formats should generate input row needs to be specified

Clint Wylie

06/22/2023, 11:45 PM

yeah that might be a good solution for now

Clint Wylie

06/22/2023, 11:45 PM

i think most direct input formats use

MapInputRowParser.parse

which is relatively straightforward i guess, the complication is for things that delegate like the kafka format and transforms

Clint Wylie

06/22/2023, 11:46 PM

https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/data/input/impl/MapInputRowParser.java#L65

Clint Wylie

06/22/2023, 11:47 PM

my recent fixes to the kafka format just re-use the methods from that, which makes it consistent

Clint Wylie

06/22/2023, 11:47 PM

but it is easy to make mistakes i think since it isn’t well defined

Xavier

06/22/2023, 11:48 PM

Part of the complication is that we pass through exclusions to optimize the parsing

Clint Wylie

06/22/2023, 11:48 PM

yeah that’s true

Clint Wylie

06/22/2023, 11:49 PM

i can’t remember if we are actually doing that for the native input formats like the older parsers were doing, particularly for things like parquet and orc where it can make a pretty big difference

Xavier

06/22/2023, 11:49 PM

And some formats have native timestamps in which case the timestampspec is redundant

Clint Wylie

06/23/2023, 12:01 AM

yeah, i guess i think of the timestamp spec as a mapping from some input row to __time maybe doing a conversion… which sounds a lot like a transform when you say it like that

Clint Wylie

06/23/2023, 12:02 AM

which i guess in newer MSQ based ingestion stuff you do use a transform to emulate the behavior of the timestamp spec

Xavier

06/23/2023, 12:26 AM

right, the druid input format natively has a timestamp so it has to be hoisted into the data somehow

Open in Slack

Previous Next