https://pinot.apache.org/ logo
#general
Title
# general
v

Vibhor Jain

10/18/2021, 12:05 PM
Hi Team, As part of handling duplicates in our hybrid table, we thought of using "mergeType": "dedup" for moving data from realtime to offline table. The problem we are facing is, one of our column is storing encrypted value and even for duplicate rows, this value is changing everytime. Is there a way to perform "dedup" on a subset of columns for moving data to offline table via minion?
m

Mayank

10/18/2021, 2:52 PM
Won’t that cause data loss due to incorrect dedup?
👍 1
v

Vibhor Jain

10/18/2021, 3:26 PM
Hi @User, by a subset of columns I mean pointing only the primary key columns. Currently for "mergeType": "dedup" config, it scans the entire row. Is there any option of restricting it to primary key-related columns somehow?
m

Mayank

10/18/2021, 4:35 PM
There isn't one right now, afaik. But I am still unclear. Let's say you have two rows with same primary key values, but different on other dimensions, which ones do you expect the dedup to drop?
v

Vibhor Jain

10/19/2021, 4:36 AM
Hi @User Sorry for the confusion, let me explain it step by step: 1. Source emits duplicate packets (say ID col is the primary key) 2. We have a hybrid table. 3. Realtime table has UPSERT enabled and schema has the primary key defined. 4. Flink is setting the primary key in the SEND api 5. We have one col where we are storing encrypted values (say for public IP). Even if its same public IP coming, the encryption algo could result in a different encrypted string. Say 72.163.12.13 could have 'AQEB08sEIVpQIwzf2OgM+dUuCwK/8pkPUiBuX8S/Kiw=' and next time same IP could have 'AQEBJ43UueX2BiqQjMICexvrXiKJjXEn8h/LyHak06E=' 5. Now, when duplicate packet comes, realtime table handles it via UPSERT. 6. Point to note: Upsert is still working on primary key col only and maintains all duplicate copies internally. 7. Now, when we move this data to offline table, via minion, we have set "mergeType" as "dedup" 8. "dedup" config works at entire row and not primary key alone for deduplication. 9. Since we have this encrypted column with different values (although all point to same IP), "dedup" is failing and showing duplicates in OFFLINE table. Now, my question was, if REALTIME table can work on primary key for UPSERT, can we use the same primary key col to perform "mergeType": "dedup" in offline table too, instead of comparing the entire row? Both rows are identical, only this encrypted value is differing (although its pointing to same public IP)
c

Chinmay Soman

10/19/2021, 9:24 PM
This makes sense - can you open an issue ? something like ability to specify primary key during realtime to offline dedup