Hello guys, I am looking at pinot upsert/dedup doc...
# general
j
Hello guys, I am looking at pinot upsert/dedup documentation. From what I understand, when upsert is enabled (lets say PK is a string and latest ts col is used determine order), at query time the latest row is returned by pinot. But all the older rows are still stored by Pinot. Is that correct? Also, what is the difference between upsert and dedup? Is it that dedup will actually discard the older row data when a PK conflict is detected?
n
correct about upsert. for dedup, the newer row will be discarded, not the older one
j
yes makes sense. @Neha Pawar Does upsert return the newest row across realtime and offline segments?
n
Upsert is only applicable to realtime tables. So it will return the newest row across the consuming and completed segments, within a single partition
j
Thanks @Neha Pawar Thats perfect. I am gonna set up a table with upsert and dedup both enabled. So within a realtime segment, rows get deduped. And at query time, newest from across consuming and completed segments get returned. Does that make sense?
n
sounds like you could just use upsert? why do you need dedup? • dedup also works across consuming + completed segments. so if you’ve enabled dedup, you will not really have more than 1 record for that primary key • as of this time, dedup and upsert cannot be configured together. but some work will be done to change that (cant say when)
j
Ah so yes I only need dedup
I actually need just one row