Hello everyone, if I use incremental mode deduped...
# ask-ai
y
Hello everyone, if I use incremental mode deduped+history for postgres<>postgres replication and my source db is currently 100 gb, will I have to maintain destination volume size as 200 gb Reason- I read that in this mode, one intermediate table is maintained on destination will have all the update records, so for each table it will be one intermediate & one final table of almost same sizes Is this true ?
k
A message from kapa.ai
Yes, when using the incremental deduped+history mode for Postgres-to-Postgres replication, Airbyte creates three tables in the destination: a raw table, a history (SCD) table, and a deduplicated table [source](https://airbyte.com/tutorials/incremental-data-synchronization). The raw table appends all records (new and old ones), while the history (SCD) table stores the history of all records (new, old, and deleted ones too) [source](https://airbytehq.slack.com/archives/C021JANJ6TY/p1671120845649769). The deduplicated table contains the final deduplicated data. Considering this, it's possible that the destination volume size may need to be larger than the source volume size. However, it's difficult to estimate the exact size without knowing the specifics of your data and how often it's updated. It's a good idea to monitor the destination volume size and adjust it as needed based on your data growth and replication requirements.
y
@kapa.ai I did not understood the second referenced link.
k
A message from kapa.ai
I'm not sure which second referenced link you're referring to. Could you please clarify which link or topic you need help with? I'll be happy to assist you with the information available in the provided extracts.
y
How 7.5m rows became 2.5m in raw table ? Refer this link @kapa.ai
k
A message from kapa.ai
In the [conversation you mentioned](https://airbytehq.slack.com/archives/C021JANJ6TY/p1671120845649769), a user was trying to understand why their table with 7.5 million rows in the source ended up with different row counts in the destination tables. The user had set up a connection with the
increment|deduped history
replication mode and had normalization enabled. The destination tables had the following row counts: - raw_table: 2.5 million rows - table: 2.1 million rows - table_scd: 2.6 million rows The user expected the
scd
and
raw
tables to have the same number of rows. However, the support team explained that: -
raw
tables append all records (new ones and old ones) -
table_scd
stores the history of all records (new ones, old, and deleted ones too) -
table
is the final table with deduplicated records The difference in row counts between the
raw
and
scd
tables could be due to the fact that the
scd
table also stores deleted records, while the
raw
table only appends new and old records. The main table has deduplicated records, which is why it has a lower row count. It's important to note that the row counts in the destination tables can vary depending on the replication mode, normalization settings, and the nature of the data being synced.
y
@kapa.ai how table_scd know about deleted rows. As per the airbyte documentation, incremental mode cannot detect deleted rows
k
A message from kapa.ai
You are correct that in the standard incremental mode, Airbyte cannot detect deleted rows. However, in the conversation you mentioned earlier, the user was using the
increment|deduped history
replication mode, which still doesn't capture deleted rows. If you need to capture deleted rows, you should consider using [Change Data Capture (CDC)](https://airbyte.com/tutorials/incremental-change-data-capture-cdc-replication) replication. CDC replication captures modifications on the source database, including deletions, and transmits them to the destination database. This is different from standard incremental replication, which cannot return deleted records and does not correctly transmit deletions from the source to the destination. For more information on CDC replication, refer to this [tutorial](https://airbyte.com/tutorials/incremental-change-data-capture-cdc-replication).
y
@kapa.ai agreed but as you confirmed user was using
incremental | deduped history
and not
cdc
so how did he got deleted rows in table_scd ?
k
A message from kapa.ai
New message text here
y
@kapa.ai Ok so back to the original question, if i use
incremental | deduped history
there will be two tables in destination - table_scd and final table, right ? table_scd will have all new and old records but not deleted ones as it is not
cdc
and final table will have only deduped records, right ?
k
A message from kapa.ai
New message text here
y
@kapa.ai this approach is taking lot of space, almost double. If source table is 100gb table_scd will be around same or more than that and final table will be 100gb so destination volume is unnecessarily increased to 200gb or more than that .
k
A message from kapa.ai
@Yusuf Mirkar That's a great question, but unfortunately I am only capable of responding to a maximum of five questions per thread. Don't hesitate to create a new thread for any further questions you have though. I'm here to help!