Hi @User
Sorry for the confusion, let me explain it step by step:
1. Source emits duplicate packets (say ID col is the primary key)
2. We have a hybrid table.
3. Realtime table has UPSERT enabled and schema has the primary key defined.
4. Flink is setting the primary key in the SEND api
5. We have one col where we are storing encrypted values (say for public IP). Even if its same public IP coming, the encryption algo could result in a different encrypted string. Say 72.163.12.13 could have 'AQEB08sEIVpQIwzf2OgM+dUuCwK/8pkPUiBuX8S/Kiw=' and next time same IP could have 'AQEBJ43UueX2BiqQjMICexvrXiKJjXEn8h/LyHak06E='
5. Now, when duplicate packet comes, realtime table handles it via UPSERT.
6. Point to note: Upsert is still working on primary key col only and maintains all duplicate copies internally.
7. Now, when we move this data to offline table, via minion, we have set "mergeType" as "dedup"
8. "dedup" config works at entire row and not primary key alone for deduplication.
9. Since we have this encrypted column with different values (although all point to same IP), "dedup" is failing and showing duplicates in OFFLINE table.
Now, my question was, if REALTIME table can work on primary key for UPSERT, can we use the same primary key col to perform "mergeType": "dedup" in offline table too, instead of comparing the entire row?
Both rows are identical, only this encrypted value is differing (although its pointing to same public IP)