Hello everyone if I use incremental mode deduped+history for Airbyte #ask-ai

Hello everyone, if I use incremental mode deduped...

Yusuf Mirkar

05/17/2023, 9:58 AM

Hello everyone, if I use incremental mode deduped+history for postgres<>postgres replication and my source db is currently 100 gb, will I have to maintain destination volume size as 200 gb Reason- I read that in this mode, one intermediate table is maintained on destination will have all the update records, so for each table it will be one intermediate & one final table of almost same sizes Is this true ?

kapa.ai

05/17/2023, 9:58 AM

A message from kapa.ai

kapa.ai

05/17/2023, 9:59 AM

Yes, when using the incremental deduped+history mode for Postgres-to-Postgres replication, Airbyte creates three tables in the destination: a raw table, a history (SCD) table, and a deduplicated table [source](https://airbyte.com/tutorials/incremental-data-synchronization). The raw table appends all records (new and old ones), while the history (SCD) table stores the history of all records (new, old, and deleted ones too) [source](https://airbytehq.slack.com/archives/C021JANJ6TY/p1671120845649769). The deduplicated table contains the final deduplicated data. Considering this, it's possible that the destination volume size may need to be larger than the source volume size. However, it's difficult to estimate the exact size without knowing the specifics of your data and how often it's updated. It's a good idea to monitor the destination volume size and adjust it as needed based on your data growth and replication requirements.

Yusuf Mirkar

05/17/2023, 10:03 AM

@kapa.ai I did not understood the second referenced link.

kapa.ai

05/17/2023, 10:03 AM

A message from kapa.ai

kapa.ai

05/17/2023, 10:04 AM

I'm not sure which second referenced link you're referring to. Could you please clarify which link or topic you need help with? I'll be happy to assist you with the information available in the provided extracts.

Yusuf Mirkar

05/17/2023, 10:04 AM

How 7.5m rows became 2.5m in raw table ? Refer this link @kapa.ai

kapa.ai

05/17/2023, 10:04 AM

A message from kapa.ai

kapa.ai

05/17/2023, 10:06 AM

In the [conversation you mentioned](https://airbytehq.slack.com/archives/C021JANJ6TY/p1671120845649769), a user was trying to understand why their table with 7.5 million rows in the source ended up with different row counts in the destination tables. The user had set up a connection with the

increment|deduped history

replication mode and had normalization enabled. The destination tables had the following row counts: - raw_table: 2.5 million rows - table: 2.1 million rows - table_scd: 2.6 million rows The user expected the

scd

and

raw

tables to have the same number of rows. However, the support team explained that: -

raw

tables append all records (new ones and old ones) -

table_scd

stores the history of all records (new ones, old, and deleted ones too) -

table

is the final table with deduplicated records The difference in row counts between the

raw

and

scd

tables could be due to the fact that the

scd

table also stores deleted records, while the

raw

table only appends new and old records. The main table has deduplicated records, which is why it has a lower row count. It's important to note that the row counts in the destination tables can vary depending on the replication mode, normalization settings, and the nature of the data being synced.

Yusuf Mirkar

05/17/2023, 10:08 AM

@kapa.ai how table_scd know about deleted rows. As per the airbyte documentation, incremental mode cannot detect deleted rows

kapa.ai

05/17/2023, 10:08 AM

A message from kapa.ai

kapa.ai

05/17/2023, 10:09 AM

You are correct that in the standard incremental mode, Airbyte cannot detect deleted rows. However, in the conversation you mentioned earlier, the user was using the

increment|deduped history

replication mode, which still doesn't capture deleted rows. If you need to capture deleted rows, you should consider using [Change Data Capture (CDC)](https://airbyte.com/tutorials/incremental-change-data-capture-cdc-replication) replication. CDC replication captures modifications on the source database, including deletions, and transmits them to the destination database. This is different from standard incremental replication, which cannot return deleted records and does not correctly transmit deletions from the source to the destination. For more information on CDC replication, refer to this [tutorial](https://airbyte.com/tutorials/incremental-change-data-capture-cdc-replication).

Yusuf Mirkar

05/17/2023, 10:11 AM

@kapa.ai agreed but as you confirmed user was using

incremental | deduped history

and not

cdc

so how did he got deleted rows in table_scd ?

kapa.ai

05/17/2023, 10:11 AM

A message from kapa.ai

kapa.ai

05/17/2023, 10:12 AM

New message text here

Yusuf Mirkar

05/17/2023, 10:15 AM

@kapa.ai Ok so back to the original question, if i use

incremental | deduped history

there will be two tables in destination - table_scd and final table, right ? table_scd will have all new and old records but not deleted ones as it is not

cdc

and final table will have only deduped records, right ?

kapa.ai

05/17/2023, 10:15 AM

A message from kapa.ai

kapa.ai

05/17/2023, 10:16 AM

New message text here

Yusuf Mirkar

05/17/2023, 10:18 AM

@kapa.ai this approach is taking lot of space, almost double. If source table is 100gb table_scd will be around same or more than that and final table will be 100gb so destination volume is unnecessarily increased to 200gb or more than that .

kapa.ai

05/17/2023, 10:18 AM

A message from kapa.ai

kapa.ai

05/17/2023, 10:18 AM

@Yusuf Mirkar That's a great question, but unfortunately I am only capable of responding to a maximum of five questions per thread. Don't hesitate to create a new thread for any further questions you have though. I'm here to help!

Open in Slack

Previous Next