Phil Sheets
05/11/2023, 8:09 PMDANGER
The main thing to keep in mind when combining these features is that upsert functionality only applies to real-time tables.
As soon as those segments are moved to an offline table, the upsert logic is no longer applied at query time. We will need to backfill the offline segments created by the real-time to offline job to achieve upsert-like queries.What does
upsert functionality only applies to real-time tables.
mean? If an update record arrives after the corresponding record has been moved to OFFLINE, will the updated record be treated as an insert and stored in the REALTIME table or dropped. If it is treated as a new record, then we would have a duplicate old record in OFFLINE.Phil Sheets
05/11/2023, 8:09 PMMark Needham
05/12/2023, 1:26 PMtimeStamp: 13 id: 1, val2: 10
As soon as the segment is moved offline, all three records would become visiblePhil Sheets
05/12/2023, 1:27 PMAs soon as the segment is moved offline, all three records would become visibleSo the move to offline does not handle reconciling the upsert? As in, it would not just move
timeStamp: 13 id: 1, val2: 10
to the offline table?Mark Needham
05/12/2023, 1:28 PMMark Needham
05/12/2023, 1:29 PMMark Needham
05/12/2023, 1:29 PMMark Needham
05/12/2023, 1:29 PMPhil Sheets
05/12/2023, 1:38 PMPhil Sheets
05/12/2023, 1:39 PMEric Liu
05/12/2023, 8:05 PMEric Liu
05/12/2023, 8:05 PM• not have a realtime -> offline job.
• after the original records have been removed from the realtime table after the retention period.Unless the retention period is configured to match the schedule of the spark/flink job, otherwise both offline and real-time might both have the records for the same keys which would still cause the duplicate issue?
Awadesh Kumar
12/05/2023, 8:03 AM