What's this concept called? Our data stack has a ...
# ask-community-for-troubleshooting
m
What's this concept called? Our data stack has a lot of data being loaded in an EL fashion by Airbyte. This causes the source tables to contain a lot of duplicated information, as "incremental replication" is not supported for some sources. For example a customer record from Intercom could be synced every day, regardless of wether the customers data data has updated or not. To avoid the source table to build up in size at a very steep curve, i have a script running in a dbt hook that deletes all duplicate records from the source table. It goes something like this:
Copy code
create or replace table `project-name.airbyte_things._airbyte_raw_things` as (

  with base as (
    select
      *,
      row_number() over(partition by sha512(_airbyte_data) order by _airbyte_emitted_at) as hashed_data_rn -- hash the data and find the row_number
    from
      `project-name.airbyte_things._airbyte_raw_things`
  )
  select * except(hashed_data_rn) from base where hashed_data_rn = 1
)
Basically it hashes the raw data in the source table, orders the entries with the same hash, and deletes every record except the first for the hash. What is this process of "condensing"/"cleaning" a source table called? 🙂
👏 1
1
u
cleaning or dedup data 😃