Lucas Segers
05/29/2024, 6:39 PMairbyte_internal
.`STREAM_NAME``
are there any recommendations to make the costs lower?kapa.ai
05/29/2024, 6:39 PMairbyte_internal
.`STREAM_NAME`` can be optimized to avoid full table scans. One approach is to assign the result of the MAX
function to a variable. This can help the BigQuery SQL optimizer to avoid full table scans. Here is an example of how you might modify the query: sql DECLARE max_extracted_at TIMESTAMP; SET max_extracted_at = (SELECT MAX(_airbyte_extracted_at) FROM `airbyte_internal`.`STREAM_NAME`); SELECT * FROM `airbyte_internal`.`STREAM_NAME` WHERE _airbyte_extracted_at >= max_extracted_at;
This approach is mentioned in the forum post [BigQuery SQL optimization](https://discuss.airbyte.io/t/1515). 2. Use Partitioned Tables: Partitioning your tables can significantly reduce the amount of data scanned by queries. By partitioning the table on the _airbyte_extracted_at
column, you can limit the query to only scan the relevant partitions. This is particularly useful for incremental data loads. Example of creating a partitioned table: sql CREATE OR REPLACE TABLE `your_dataset.your_table` PARTITION BY DATE(_airbyte_extracted_at) AS SELECT * FROM `your_dataset.your_source_table`;
This recommendation is aligned with the forum post [Destination BigQuery - Deduped + history generates too much processing costs](https://discuss.airbyte.io/t/2445). 3. Clustering: Clustering your tables can also help improve query performance by organizing the data in a way that makes it more efficient to scan. For example, clustering on _airbyte_extracted_at
and other relevant columns can help reduce the amount of data processed. Example of creating a clustered table: sql CREATE OR REPLACE TABLE `your_dataset.your_table` CLUSTER BY _airbyte_extracted_at, _airbyte_unique_key_scd AS SELECT * FROM `your_dataset.your_source_table`;
This approach is discussed in the [Github Issue on 1s1t Performance Testing](https://github.com/airbytehq/airbyte/issues/27571). By implementing these optimizations, you can reduce the costs associated with loading and querying data in BigQuery. If you need further assistance, you can refer to the [BigQuery documentation on partitioned tables](https://cloud.google.com/bigquery/docs/partitioned-tables) and [clustering](https://cloud.google.com/bigquery/docs/clustered-tables) for more details.Lucas Segers
05/29/2024, 6:44 PMkapa.ai
05/29/2024, 6:44 PMmax(_airbyte_emitted_at)
to a variable to avoid a full table scan, as mentioned in the forum post. 3. Replace the SQL Files: After editing the SQL query, you need to put the modified file back into the Docker image. Here is a step-by-step guide based on the information provided: ### Step 1: Export the SQL Files You can use Docker commands to copy the SQL files from the Docker container to your local machine. For example: sh docker cp <container_id>:/path/to/sql/file.sql /local/path/to/sql/file.sql
### Step 2: Edit the SQL Query Open the exported SQL file in your preferred text editor and make the necessary changes. For example, you can modify the part of the query that selects max(_airbyte_emitted_at)
to use a variable instead: sql DECLARE max_emitted_at TIMESTAMP; SET max_emitted_at = (SELECT MAX(CAST(_airbyte_emitted_at AS TIMESTAMP)) FROM `marketplace-reporting`.airbyte.`lazada_raw_orders_scd`); CREATE OR REPLACE TABLE `marketplace-reporting`.airbyte.`lazada_raw_orders_scd__dbt_tmp` PARTITION BY RANGE_BUCKET( _airbyte_active_row, GENERATE_ARRAY(0, 1, 1) ) CLUSTER BY _airbyte_unique_key_scd, _airbyte_emitted_at OPTIONS( expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 HOUR) ) AS ( WITH new_data AS ( SELECT * FROM `marketplace-reporting`._airbyte_airbyte.`lazada_raw_orders_stg` WHERE 1 = 1 AND COALESCE( CAST(_airbyte_emitted_at AS TIMESTAMP) >= max_emitted_at, TRUE ) ) -- Additional query logic here );
### Step 3: Replace the SQL Files After editing the SQL file, copy it back to the Docker container: sh docker cp /local/path/to/sql/file.sql <container_id>:/path/to/sql/file.sql
### Additional Resources For more detailed instructions on customizing SQL queries in Airbyte, you can refer to the [Airbyte documentation on transformations with SQL](https://docs.airbyte.com/operator-guides/transformation-and-normalization/transformations-with-sql#customize-sql-query). If you encounter any issues or need further assistance, you can also check the [Airbyte community forum](https://discuss.airbyte.io) for additional support and examples. By following these steps, you should be able to customize the auto-generated SQL query to better suit your needs and optimize performance.