So we have data that contains an event id that is obviously Apache Pinot #troubleshooting

So we have data that contains an event_id that is ...

Stuart Millholland

06/10/2022, 5:45 PM

So we have data that contains an event_id that is obviously very high cardinality, a unique record per row. The thing is, we don't really have any use at all for the field other than for audit purposes so we're considering getting rid of it. But let's say we did keep it and we employee the pinot managed offline flows to roll things up. Can you ignore a field during rollup so that you keep the detail for a time but as you move to offline you can default it to something so that it basically gets rolled up?

Ken Krugler

06/10/2022, 6:27 PM

What’s the concern that makes you want to drop or default that field? If it’s size, you can have a field with no index/dictionary, in which case it won’t take up much space…

Stuart Millholland

06/10/2022, 6:28 PM

it's just a high cardinality field that I'm arguing we don't need to have in pinot. But in our discussions some people feel we need it for auditing.

Ken Krugler

06/10/2022, 6:29 PM

And the issue with high cardinality is?

Stuart Millholland

06/10/2022, 6:29 PM

space

Stuart Millholland

06/10/2022, 6:30 PM

and during our rollup to offline we would have to keep that field right? So keeping a lot more data than we really need.

Stuart Millholland

06/10/2022, 6:30 PM

We'd basically get no gain from the rollup

Ken Krugler

06/10/2022, 6:30 PM

I would suggest defining a schema/table where that field has no dictionary or index, and then see if you save a significant amount of space (versus removing it).

Stuart Millholland

06/10/2022, 6:31 PM

well what about the rollup component?

Stuart Millholland

06/10/2022, 6:31 PM

https://docs.pinot.apache.org/operators/operating-pinot/pinot-managed-offline-flows

Stuart Millholland

06/10/2022, 6:32 PM

So our thought is if we remove that field, we ingest the same amount of data into pinot in terms of rows, but as we move to offline and rollup, we will aggregate the metrics across the common dimensions.

Stuart Millholland

06/10/2022, 6:33 PM

If we have event_id in there, we won't have a good rollup rate, b/c event_id is unique for every row

Stuart Millholland

06/10/2022, 6:33 PM

And I should mention we don't have a use case for filtering or grouping by event_id

Stuart Millholland

06/10/2022, 6:33 PM

It would only be there for audit purposes, which to me seems like a weak reason to have it

Kishore G

06/10/2022, 6:43 PM

You are thinking about it the right way.. you can remove or make event if null in the roll up

Stuart Millholland

06/10/2022, 6:43 PM

ohhh so the rollup would allow me to null as I move the data?

Kishore G

06/10/2022, 6:43 PM

I don’t know about that

Stuart Millholland

06/10/2022, 6:43 PM

Stuart Millholland

06/10/2022, 6:43 PM

I'll play w/ rollup next to see

Kishore G

06/10/2022, 6:43 PM

But should be easy to add a transform function

Ken Krugler

06/10/2022, 6:43 PM

OK, got it - I took a quick walk through the code, and row equality during a rollup (reduce) is based on “sort fields.”

Stuart Millholland

06/10/2022, 6:43 PM

what my options are there

Stuart Millholland

06/10/2022, 6:44 PM

so as I rollup I could transform a field to some default sounds like

Kishore G

06/10/2022, 6:44 PM

Yeah

Kishore G

06/10/2022, 6:44 PM

But please confirm

Stuart Millholland

06/10/2022, 6:44 PM

I def will and thanks!

Kishore G

06/10/2022, 6:45 PM

Design wise I don’t see any big change adding a transform on a column

Ken Krugler

06/10/2022, 6:45 PM

@Jackie - what defined equality between rows when doing a rollup?

Stuart Millholland

06/10/2022, 6:47 PM

If this works as I think it might I could have the best of both worlds maybe. A certain amount of my data is in realtime table w/ the event_id for potential auditing, then as I move to offline table I default event_id to 0 and get good rollup.

Stuart Millholland

06/10/2022, 6:48 PM

Let me know if that sounds reasonable

Jackie

06/10/2022, 7:01 PM

@Stuart Millholland It is absolutely reasonable. We don't support it currently, but it is doable. Essentially we need to add a new task config to skip some columns when running the task in

ROLLUP

DEDUP

mode. Internally we will fill default values to these columns so that they won't be considered. Can you please help file a github issue describing the requirements?

Stuart Millholland

06/10/2022, 7:01 PM

yep, will do!

Stuart Millholland

06/10/2022, 7:15 PM

I'll get something out there by Monday at the latest

Stuart Millholland

06/13/2022, 6:37 PM

https://github.com/apache/pinot/issues/8886

Stuart Millholland

06/13/2022, 6:38 PM

I'm happy to add whatever needs to be added there in terms of issue content/context

Open in Slack

Previous Next