Taher Ajnawala
07/18/2024, 4:59 AMkapa.ai
07/18/2024, 4:59 AMairbyte_internal
namespace, are used for staging purposes during the data synchronization process. As per the [discussion on GitHub](https://github.com/airbytehq/airbyte/discussions/39001), it's generally recommended to partition the destination table by the ingestion time, as it affects only the current partition and minimizes the partitions that need to be scanned when merging the data from the temporary table in airbyte_internal
and the destination table. However, it's important to note that the Airbyte destination tables are typically optimized for write-performance and not so much for read performance. Therefore, it's often recommended to use a modeling tool like dbt or Dataform to materialize a read-optimized version of the data that is partitioned and clustered based on the needs of the use-case. As for partitioning the internal tables of the Airbyte Postgres destination, it's not explicitly mentioned in the provided sources. However, considering that Postgres is not recommended as a destination for large data volumes (over 10GB) due to performance issues, as stated in the [Airbyte documentation](https://docs.airbyte.com/integrations/destinations/postgres/postgres-troubleshooting#postgres-is-not-a-data-warehouse), it might be more beneficial to consider using a data warehouse like BigQuery, Snowflake, or Redshift for handling large data volumes. Before making any changes, it's crucial to understand the implications and potential costs associated with partitioning, and it's always a good idea to consult with your database administrator or a data engineer.Taher Ajnawala
07/26/2024, 7:12 AMkapa.ai
07/26/2024, 7:13 AM_airbyte_extracted_at
and _airbyte_loaded_at
depends on the context of your data processing needs. - `_airbyte_extracted_at`: This column represents a timestamp indicating when the event was pulled from the data source. It is used in various destinations like [Redshift](https://docs.airbyte.com/integrations/destinations/redshift#output-schema), [Postgres](https://docs.airbyte.com/integrations/destinations/postgres#output-schema-raw-tables), [BigQuery](https://docs.airbyte.com/integrations/destinations/bigquery#output-schema), [Yellowbrick](https://docs.airbyte.com/integrations/destinations/yellowbrick#output-schema-raw-tables), and [Snowflake](https://docs.airbyte.com/integrations/destinations/snowflake#output-schema). This timestamp is often used when you want to know when the data was actually extracted from the source. - `_airbyte_loaded_at`: This column represents a timestamp indicating when the row was processed into the final table. It is also used in various destinations like [Redshift](https://docs.airbyte.com/integrations/destinations/redshift#output-schema), [Postgres](https://docs.airbyte.com/integrations/destinations/postgres#output-schema-raw-tables), [BigQuery](https://docs.airbyte.com/integrations/destinations/bigquery#output-schema), [Yellowbrick](https://docs.airbyte.com/integrations/destinations/yellowbrick#output-schema-raw-tables), and [Snowflake](https://docs.airbyte.com/integrations/destinations/snowflake#output-schema). This timestamp is often used when you want to know when the data was loaded into the destination. In the context of data quality monitoring, for example, you might use _airbyte_extracted_at
as the date column that re_data should use to calculate metrics, as it most accurately reflects when the data was generated ([source](https://airbyte.com/tutorials/identify-data-quality-issues-on-data-ingestion-pipelines#step-2)). In the context of upgrading to Destinations V2, Airbyte recommends using _airbyte_extracted_at
instead of the no longer available _airbyte_normalized_at
([source](https://docs.airbyte.com/release_notes/upgrading_to_destinations_v2#updating-downstream-transformations)). So, the choice between _airbyte_extracted_at
and _airbyte_loaded_at
depends on whether you're more interested in the time the data was extracted from the source or the time it was loaded into the destination.