Hi team, I found a small issue in the data normali...
# ask-community-for-troubleshooting
t
Hi team, I found a small issue in the data normalisation. We are currently working with a custom connector that is using the
full-refresh append
sync method. The issue that we are currently seeing is that in the
_airbyte_raw
(before normalisation) table there is only one record for each id emitted per day. However, there are duplicates on the
raw_
(after normalisation) table based on the
_airbyte_normalized_at
field. For some reason, records that are emitted on two different days are being normalised on the same day. Could someone help provide us with some insight to why this behaviour is occurring and what we can do to fix it? Please see attached screenshot of the duplication based on
_airbyte_normalized_at
. Data has been obscured for privacy purposes. Thank you so much in advance!
@Mark Suemegi
s
Hey, thanks for including this great explanation and screenshots. My understanding of
full refresh append
is that it's expected to have duplicates in the final tables, as normalization will append whatever is in raw tables onto the final tables. I've attached a graphic from this blog post that helps illustrate. I'm not sure why the
_airbyte_raw
tables don't have duplicates... unless they're getting cleaned up somehow? I would expect the historical data to still be there, but I could be wrong -- I'll try to find an answer for this. But, what's happening with the final tables does seem to be expected behaviour. There's a bit more on why records emitted on two different days are normalized on the same day here: https://docs.airbyte.com/understanding-airbyte/basic-normalization/#incremental-runs Either way, if you need better deduplication, incremental sync is the way to go if it's possible
m
@Lara Tanbari @Willi
Hi Sunny, sorry for getting back to you with delay, we have investigated your answer 🙂 When we talk about duplicates we mean "duplicates on the same day". Our data source delivers records every day (1 record per unique id). These IDs of course can repeat the next day. I also double checked and in our data source this uniqueness condition is met. Our problem is let's say, now I want to get data on a specific day. If I filter on
date(_airbyte_normalized_at) = '2022-10-26'
in the above mentioned example by @Tanasorn Chindasook we will get two records for the same id, (one emitted the day before). This was surprising to us because we would have thought that after all records are emitted on a day they are also all normalized. However here it looks like (again in the example on top) that the record which was emitted on the 25th was only normalized on the 26th. (If I search in our database where these normalized_at and emitted_at values are on different dates I get very few rows compared to the size of our tables). What we would like to understand is why can this happen for records where the normalization is one day later, when we use the full-refresh append logic, where we expect every record every day to be appended to our raw table without any other consideration. Thanks again for the support! 😊
m
What is the cursor field used for this custom source?