I have a question about this statement: > DANGE...
# getting-started
p
I have a question about this statement:
DANGER
The main thing to keep in mind when combining these features is that upsert functionality only applies to real-time tables.
As soon as those segments are moved to an offline table, the upsert logic is no longer applied at query time. We will need to backfill the offline segments created by the real-time to offline job to achieve upsert-like queries.
What does
upsert functionality only applies to real-time tables.
mean? If an update record arrives after the corresponding record has been moved to OFFLINE, will the updated record be treated as an insert and stored in the REALTIME table or dropped. If it is treated as a new record, then we would have a duplicate old record in OFFLINE.
m
good question, let me check. That bit of the documentation that you quoted was describing the situation where you have say: timeStamp: 1, id: 1, val2: 3 timeStamp: 2. id: 1, val2: 4 timeStamp: 13 id: 1, val2: 10 While all those records are in the real-time table there’s a map that keeps track of which document should be returned when we ask for id=1. So in this case it’d only return this record:
Copy code
timeStamp: 13 id: 1, val2: 10
As soon as the segment is moved offline, all three records would become visible
p
Thanks Mark
As soon as the segment is moved offline, all three records would become visible
So the move to offline does not handle reconciling the upsert? As in, it would not just move
timeStamp: 13 id: 1, val2: 10
to the offline table?
m
correct
the segment itself always has all the records, it’s just that they only become visible when you move it to offline
we should probably make a github issue to handle that
because it’s surprising behaviour
p
First thought at a work around: • not have a realtime -> offline job. • Have a spark/flink job that builds the offline segments from a different source that can handle the upsert reconciliation. • This still does not solve the issue of an update arriving after the original records have been removed from the realtime table after the retention period. This would cause a duplicate record to exist in the realtime table and the offline table (assuming a batch job updated the offline segment with the update) ◦ We would need to filter the updates that are out of the realtime table time window out of the ingested kafka topic.
@Eric Liu @Maitraiyee Gautam 👀
👀 1
✔️ 1
e
@Mark Needham It would be great if we can fix that issue! 🙌
@Phil Sheets
• not have a realtime -> offline job.
• after the original records have been removed from the realtime table after the retention period.
Unless the retention period is configured to match the schedule of the spark/flink job, otherwise both offline and real-time might both have the records for the same keys which would still cause the duplicate issue?
a
@Mark Needham is this issue fixed? We are exploring Pinot for an use-case having frequent updates/upsert.