So we have data that contains an event_id that is ...
# troubleshooting
s
So we have data that contains an event_id that is obviously very high cardinality, a unique record per row. The thing is, we don't really have any use at all for the field other than for audit purposes so we're considering getting rid of it. But let's say we did keep it and we employee the pinot managed offline flows to roll things up. Can you ignore a field during rollup so that you keep the detail for a time but as you move to offline you can default it to something so that it basically gets rolled up?
k
What’s the concern that makes you want to drop or default that field? If it’s size, you can have a field with no index/dictionary, in which case it won’t take up much space…
s
it's just a high cardinality field that I'm arguing we don't need to have in pinot. But in our discussions some people feel we need it for auditing.
k
And the issue with high cardinality is?
s
space
and during our rollup to offline we would have to keep that field right? So keeping a lot more data than we really need.
We'd basically get no gain from the rollup
k
I would suggest defining a schema/table where that field has no dictionary or index, and then see if you save a significant amount of space (versus removing it).
s
well what about the rollup component?
So our thought is if we remove that field, we ingest the same amount of data into pinot in terms of rows, but as we move to offline and rollup, we will aggregate the metrics across the common dimensions.
If we have event_id in there, we won't have a good rollup rate, b/c event_id is unique for every row
And I should mention we don't have a use case for filtering or grouping by event_id
It would only be there for audit purposes, which to me seems like a weak reason to have it
k
You are thinking about it the right way.. you can remove or make event if null in the roll up
s
ohhh so the rollup would allow me to null as I move the data?
k
I don’t know about that
s
ok
I'll play w/ rollup next to see
k
But should be easy to add a transform function
k
OK, got it - I took a quick walk through the code, and row equality during a rollup (reduce) is based on “sort fields.”
s
what my options are there
so as I rollup I could transform a field to some default sounds like
k
Yeah
But please confirm
s
I def will and thanks!
k
Design wise I don’t see any big change adding a transform on a column
k
@Jackie - what defined equality between rows when doing a rollup?
s
If this works as I think it might I could have the best of both worlds maybe. A certain amount of my data is in realtime table w/ the event_id for potential auditing, then as I move to offline table I default event_id to 0 and get good rollup.
Let me know if that sounds reasonable
j
@Stuart Millholland It is absolutely reasonable. We don't support it currently, but it is doable. Essentially we need to add a new task config to skip some columns when running the task in
ROLLUP
or
DEDUP
mode. Internally we will fill default values to these columns so that they won't be considered. Can you please help file a github issue describing the requirements?
s
yep, will do!
I'll get something out there by Monday at the latest
I'm happy to add whatever needs to be added there in terms of issue content/context