Andreas Nigg

08/02/2022, 7:36 AM
Hey there, I've a general question about dbt data transformations, not really related to airbyte. We use airbyte to ingest data to our warehouse (bigquery). From there, we use dbt to transform them as we need. One of our data imports is rather huge (let's say 100GB in total to make it easy). We use airbyte to daily ingest additional 1GB. This daily ingest also creates a lot of duplicates (so the 100GB table already contains some of the rows, which are inserted with the daily insert) - this is due to underlying data structure and not much we can do about it. How would you actually go ahead and deduplicate this data? I would like to prevent daily reading 100GB of data, just for deduplication. Any ideas for that? Thanks already in advance 😄 🚀