should you partition the internal tables of airbyte postgres Airbyte #ask-ai

should you partition the internal tables of airbyt...

Taher Ajnawala

07/18/2024, 4:59 AM

should you partition the internal tables of airbyte postgres destination? some of my tables are huge in size and partitioning them would be efficient.

kapa.ai

07/18/2024, 4:59 AM

Airbyte's internal tables, specifically in the

airbyte_internal

namespace, are used for staging purposes during the data synchronization process. As per the [discussion on GitHub](https://github.com/airbytehq/airbyte/discussions/39001), it's generally recommended to partition the destination table by the ingestion time, as it affects only the current partition and minimizes the partitions that need to be scanned when merging the data from the temporary table in

airbyte_internal

and the destination table. However, it's important to note that the Airbyte destination tables are typically optimized for write-performance and not so much for read performance. Therefore, it's often recommended to use a modeling tool like dbt or Dataform to materialize a read-optimized version of the data that is partitioned and clustered based on the needs of the use-case. As for partitioning the internal tables of the Airbyte Postgres destination, it's not explicitly mentioned in the provided sources. However, considering that Postgres is not recommended as a destination for large data volumes (over 10GB) due to performance issues, as stated in the [Airbyte documentation](https://docs.airbyte.com/integrations/destinations/postgres/postgres-troubleshooting#postgres-is-not-a-data-warehouse), it might be more beneficial to consider using a data warehouse like BigQuery, Snowflake, or Redshift for handling large data volumes. Before making any changes, it's crucial to understand the implications and potential costs associated with partitioning, and it's always a good idea to consult with your database administrator or a data engineer.

Taher Ajnawala

07/26/2024, 7:12 AM

@kapa.ai which column would it be from below: _airbyte_extracted_at _airbyte_loaded_at

kapa.ai

07/26/2024, 7:13 AM

The column you choose between

_airbyte_extracted_at

and

_airbyte_loaded_at

depends on the context of your data processing needs. - `_airbyte_extracted_at`: This column represents a timestamp indicating when the event was pulled from the data source. It is used in various destinations like [Redshift](https://docs.airbyte.com/integrations/destinations/redshift#output-schema), [Postgres](https://docs.airbyte.com/integrations/destinations/postgres#output-schema-raw-tables), [BigQuery](https://docs.airbyte.com/integrations/destinations/bigquery#output-schema), [Yellowbrick](https://docs.airbyte.com/integrations/destinations/yellowbrick#output-schema-raw-tables), and [Snowflake](https://docs.airbyte.com/integrations/destinations/snowflake#output-schema). This timestamp is often used when you want to know when the data was actually extracted from the source. - `_airbyte_loaded_at`: This column represents a timestamp indicating when the row was processed into the final table. It is also used in various destinations like [Redshift](https://docs.airbyte.com/integrations/destinations/redshift#output-schema), [Postgres](https://docs.airbyte.com/integrations/destinations/postgres#output-schema-raw-tables), [BigQuery](https://docs.airbyte.com/integrations/destinations/bigquery#output-schema), [Yellowbrick](https://docs.airbyte.com/integrations/destinations/yellowbrick#output-schema-raw-tables), and [Snowflake](https://docs.airbyte.com/integrations/destinations/snowflake#output-schema). This timestamp is often used when you want to know when the data was loaded into the destination. In the context of data quality monitoring, for example, you might use

_airbyte_extracted_at

as the date column that re_data should use to calculate metrics, as it most accurately reflects when the data was generated ([source](https://airbyte.com/tutorials/identify-data-quality-issues-on-data-ingestion-pipelines#step-2)). In the context of upgrading to Destinations V2, Airbyte recommends using

_airbyte_extracted_at

instead of the no longer available

_airbyte_normalized_at

([source](https://docs.airbyte.com/release_notes/upgrading_to_destinations_v2#updating-downstream-transformations)). So, the choice between

_airbyte_extracted_at

and

_airbyte_loaded_at

depends on whether you're more interested in the time the data was extracted from the source or the time it was loaded into the destination.

Open in Slack

Previous Next