I have a question about this statement gt DANGER gt The main Apache Pinot #getting-started

I have a question about this statement: > DANGE...

Phil Sheets

05/11/2023, 8:09 PM

I have a question about this statement:

DANGER

The main thing to keep in mind when combining these features is that upsert functionality only applies to real-time tables.

As soon as those segments are moved to an offline table, the upsert logic is no longer applied at query time. We will need to backfill the offline segments created by the real-time to offline job to achieve upsert-like queries.

What does

upsert functionality only applies to real-time tables.

mean? If an update record arrives after the corresponding record has been moved to OFFLINE, will the updated record be treated as an insert and stored in the REALTIME table or dropped. If it is treated as a new record, then we would have a duplicate old record in OFFLINE.

Phil Sheets

05/11/2023, 8:09 PM

https://dev.startree.ai/docs/pinot/recipes/upserts-real-time-offline-job

Mark Needham

05/12/2023, 1:26 PM

good question, let me check. That bit of the documentation that you quoted was describing the situation where you have say: timeStamp: 1, id: 1, val2: 3 timeStamp: 2. id: 1, val2: 4 timeStamp: 13 id: 1, val2: 10 While all those records are in the real-time table there’s a map that keeps track of which document should be returned when we ask for id=1. So in this case it’d only return this record:

Copy code

timeStamp: 13 id: 1, val2: 10

As soon as the segment is moved offline, all three records would become visible

Phil Sheets

05/12/2023, 1:27 PM

Thanks Mark

As soon as the segment is moved offline, all three records would become visible

So the move to offline does not handle reconciling the upsert? As in, it would not just move

timeStamp: 13 id: 1, val2: 10

to the offline table?

Mark Needham

05/12/2023, 1:28 PM

correct

Mark Needham

05/12/2023, 1:29 PM

the segment itself always has all the records, it’s just that they only become visible when you move it to offline

Mark Needham

05/12/2023, 1:29 PM

we should probably make a github issue to handle that

Mark Needham

05/12/2023, 1:29 PM

because it’s surprising behaviour

Phil Sheets

05/12/2023, 1:38 PM

First thought at a work around: • not have a realtime -> offline job. • Have a spark/flink job that builds the offline segments from a different source that can handle the upsert reconciliation. • This still does not solve the issue of an update arriving after the original records have been removed from the realtime table after the retention period. This would cause a duplicate record to exist in the realtime table and the offline table (assuming a batch job updated the offline segment with the update) ◦ We would need to filter the updates that are out of the realtime table time window out of the ingested kafka topic.

Phil Sheets

05/12/2023, 1:39 PM

@Eric Liu @Maitraiyee Gautam 👀

👀 1

✔️ 1

Eric Liu

05/12/2023, 8:05 PM

@Mark Needham It would be great if we can fix that issue! 🙌

Eric Liu

05/12/2023, 8:05 PM

@Phil Sheets

• not have a realtime -> offline job.

• after the original records have been removed from the realtime table after the retention period.

Unless the retention period is configured to match the schedule of the spark/flink job, otherwise both offline and real-time might both have the records for the same keys which would still cause the duplicate issue?

Awadesh Kumar

12/05/2023, 8:03 AM

@Mark Needham is this issue fixed? We are exploring Pinot for an use-case having frequent updates/upsert.

Open in Slack

Previous Next