Tanasorn Chindasook
10/31/2022, 10:54 AMfull-refresh append
sync method. The issue that we are currently seeing is that in the _airbyte_raw
(before normalisation) table there is only one record for each id emitted per day. However, there are duplicates on the raw_
(after normalisation) table based on the _airbyte_normalized_at
field. For some reason, records that are emitted on two different days are being normalised on the same day. Could someone help provide us with some insight to why this behaviour is occurring and what we can do to fix it?
Please see attached screenshot of the duplication based on _airbyte_normalized_at
. Data has been obscured for privacy purposes. Thank you so much in advance!Tanasorn Chindasook
10/31/2022, 4:08 PMSunny Hashmi (Airbyte)
11/01/2022, 2:43 PMfull refresh append
is that it's expected to have duplicates in the final tables, as normalization will append whatever is in raw tables onto the final tables. I've attached a graphic from this blog post that helps illustrate.
I'm not sure why the _airbyte_raw
tables don't have duplicates... unless they're getting cleaned up somehow? I would expect the historical data to still be there, but I could be wrong -- I'll try to find an answer for this. But, what's happening with the final tables does seem to be expected behaviour.
There's a bit more on why records emitted on two different days are normalized on the same day here: https://docs.airbyte.com/understanding-airbyte/basic-normalization/#incremental-runs
Either way, if you need better deduplication, incremental sync is the way to go if it's possibleMark Suemegi
11/02/2022, 8:54 AMMark Suemegi
11/14/2022, 10:10 AMdate(_airbyte_normalized_at) = '2022-10-26'
in the above mentioned example by @Tanasorn Chindasook we will get two records for the same id, (one emitted the day before). This was surprising to us because we would have thought that after all records are emitted on a day they are also all normalized. However here it looks like (again in the example on top) that the record which was emitted on the 25th was only normalized on the 26th. (If I search in our database where these normalized_at and emitted_at values are on different dates I get very few rows compared to the size of our tables).
What we would like to understand is why can this happen for records where the normalization is one day later, when we use the full-refresh append logic, where we expect every record every day to be appended to our raw table without any other consideration.
Thanks again for the support! 😊Marcos Marx (Airbyte)
12/06/2022, 8:00 PM