As part of handling duplicates in our hybrid table, we thought of using "mergeType": "dedup" for moving data from realtime to offline table.
The problem we are facing is, one of our column is storing encrypted value and even for duplicate rows, this value is changing everytime.
Is there a way to perform "dedup" on a subset of columns for moving data to offline table via minion?
10/18/2021, 2:52 PM
Won’t that cause data loss due to incorrect dedup?
10/18/2021, 3:26 PM
Hi @User, by a subset of columns I mean pointing only the primary key columns. Currently for "mergeType": "dedup" config, it scans the entire row. Is there any option of restricting it to primary key-related columns somehow?
10/18/2021, 4:35 PM
There isn't one right now, afaik. But I am still unclear. Let's say you have two rows with same primary key values, but different on other dimensions, which ones do you expect the dedup to drop?
10/19/2021, 4:36 AM
Sorry for the confusion, let me explain it step by step:
1. Source emits duplicate packets (say ID col is the primary key)
2. We have a hybrid table.
3. Realtime table has UPSERT enabled and schema has the primary key defined.
4. Flink is setting the primary key in the SEND api
5. We have one col where we are storing encrypted values (say for public IP). Even if its same public IP coming, the encryption algo could result in a different encrypted string. Say 188.8.131.52 could have 'AQEB08sEIVpQIwzf2OgM+dUuCwK/8pkPUiBuX8S/Kiw=' and next time same IP could have 'AQEBJ43UueX2BiqQjMICexvrXiKJjXEn8h/LyHak06E='
5. Now, when duplicate packet comes, realtime table handles it via UPSERT.
6. Point to note: Upsert is still working on primary key col only and maintains all duplicate copies internally.
7. Now, when we move this data to offline table, via minion, we have set "mergeType" as "dedup"
8. "dedup" config works at entire row and not primary key alone for deduplication.
9. Since we have this encrypted column with different values (although all point to same IP), "dedup" is failing and showing duplicates in OFFLINE table.
Now, my question was, if REALTIME table can work on primary key for UPSERT, can we use the same primary key col to perform "mergeType": "dedup" in offline table too, instead of comparing the entire row?
Both rows are identical, only this encrypted value is differing (although its pointing to same public IP)
10/19/2021, 9:24 PM
This makes sense - can you open an issue ? something like ability to specify primary key during realtime to offline dedup