This message was deleted.
# dev
s
This message was deleted.
c
will check this out in a bit
For instance, it’s possible for an InputFormat to natively populate the timestamp field of an InputRow without creating a separate “column” for it. However it then becomes impossible to reference that in a timestamp spec.
assuming all input formats these days use the map based input row, it seems not very cool that stuff would leave the source time column out of the raw event
it makes sense to me that the time column would be excluded from the dimensions list, but not being available via getRaw seems incorrect
naively it seems like __time shouldn’t be in the raw event map unless there was in fact a column in the underlying input with that name
ive been discussing with eric a bit and we’ve been wondering if we should just dump
InputRow
and start dealing with plain Map directly, since
InputRow
seems to just increase the chances of making mistakes
then dimensions and exclusions and everything else are a concern for the thing processing the map instead of something we push down into the row
but thats a bit of a task
i think input time column eventually needs excluded from the ‘dimensions’ list of the row spit out by
getDimensions
or else it will end up becoming part of the row key iirc which would interfere with rollup (i could be misremembering)
x
I think it’s not clear which columns should be in the input row.
c
yea
for map based row i think like, all of them probably? which i think is part of why we were discussing if InputRow is really that useful
x
Even with a map it’s not clear what’s in dimensions
It doesn’t always match the underlying map keys
c
they should all be available via getRaw at least
i agree its confusing what getDimension. getDimensions, and getMetric are supposed to do
which sort of relies on something externally populating the list of metrics which should be excluded as well as the timestamp input column
x
The role of the time stamp is not clear either
c
where presumably the exclusions are the metrics
i think row.getTimestamp is supposed to be the way to get the __time column value for the row, regardless of what input column it came from via the timestamp spec and doing any translation necessary
since like the input column might have been a string or whatever else and isn’t likely consumable as a datetime directly
the more we discuss this i think the more in favor i am of just dropping inputRow and making IncrementalIndex do all the work 😅
then the map is just the plain java type representation of what the input reader spits out
for the kafka format it would wrap/decorate the payloads map
x
Right, for transforms we have to hoist it back into a time could
c
and transforms wrap that
and then incremental index can read stuff out however it wants and just treat the leftovers of aggs/transforms that are agg inputs/timestamp as the discovered dimensions
i have a fear that the old hadoop parsers might make this easier said than done though
i wish it could just .. go away
x
In think the api contract for how input formats should generate input row needs to be specified
c
yeah that might be a good solution for now
i think most direct input formats use
MapInputRowParser.parse
which is relatively straightforward i guess, the complication is for things that delegate like the kafka format and transforms
my recent fixes to the kafka format just re-use the methods from that, which makes it consistent
but it is easy to make mistakes i think since it isn’t well defined
x
Part of the complication is that we pass through exclusions to optimize the parsing
c
yeah that’s true
i can’t remember if we are actually doing that for the native input formats like the older parsers were doing, particularly for things like parquet and orc where it can make a pretty big difference
x
And some formats have native timestamps in which case the timestampspec is redundant
c
yeah, i guess i think of the timestamp spec as a mapping from some input row to __time maybe doing a conversion… which sounds a lot like a transform when you say it like that
which i guess in newer MSQ based ingestion stuff you do use a transform to emulate the behavior of the timestamp spec
x
right, the druid input format natively has a timestamp so it has to be hoisted into the data somehow