https://pinot.apache.org/ logo
#troubleshooting
Title
# troubleshooting
l

Lars-Kristian Svenøy

04/28/2022, 8:49 AM
Hello everyone. Thanks again for all your help with everything so far. I have a question regarding upsert, and how to deal with deduping for a certain scenario.. details in thread.
So we have multiple tables where whenever new data comes in, that defines the entire state of that entity. Unfortunately as the incoming events is parsed into a 1:Many relationship, we have no reliable way of deduping data, as if we specify a primary key down to the granularity of that relationship, we end up retaining data which was not provided in that new event. I'm not entirely sure how to deal with this issue, but one potential solution seems to be to use a json object type, allowing us to store the many relationship in that. Unfortunately, there are limitations with that json relationship which makes it infeasible to do so. My question is whether it is possible to achieve this at all using an upsert table, or if I would need a custom batch job to do this.. I had a thought that perhaps it would be possible to store the event with the composite key in kafka, but extract the JSON payload during realtime ingestion. I'm assuming this doesn't work, any thoughts?
m

Mayank

04/28/2022, 4:55 PM
Trying to understand, is the requirement that with each new event for the same key, we need to append to the JSON payload (or attributes)? cc: @User
l

Lars-Kristian Svenøy

04/29/2022, 8:56 AM
Not exactly. I was wondering if it would be possible to track 1:Many relationships somehow, and update references related to upsert for all of them
It's probably not feasible in any case
Let's say I have an event..
Copy code
idPart1
idPart2
List<SubEvents>
That gets split into multiple rows
Copy code
idPart1
idPart2
SubEvent
I want to be able to, as part of upsert, somehow be able to know that all of those old documents should no longer be referenced.
The problem with us trying to use upsert today, is that these 1:Many relationships mean only updates will be captured, not removal of information.
This can be solved using batch ingestion, but it's quite a bit of work to get batch ingestion setup