I have a bigquery destination that loads constantly lots of Airbyte #ask-ai

I have a bigquery destination that loads constantl...

Lucas Segers

05/29/2024, 6:39 PM

I have a bigquery destination that loads constantly lots of records on a table I can see that it makes a query that scans the entire table SELECT MAX(_airbyte_extracted_at) FROM
airbyte_internal
.`STREAM_NAME`` are there any recommendations to make the costs lower?

kapa.ai

05/29/2024, 6:39 PM

To address the high costs associated with the query that scans the entire table in BigQuery, you can consider the following recommendations based on the provided knowledge sources: 1. Optimize the SQL Query: The query `SELECT MAX(_airbyte_extracted_at) FROM

airbyte_internal

.`STREAM_NAME`` can be optimized to avoid full table scans. One approach is to assign the result of the

MAX

function to a variable. This can help the BigQuery SQL optimizer to avoid full table scans. Here is an example of how you might modify the query:

Copy code

sql    DECLARE max_extracted_at TIMESTAMP;    SET max_extracted_at = (SELECT MAX(_airbyte_extracted_at) FROM `airbyte_internal`.`STREAM_NAME`);        SELECT * FROM `airbyte_internal`.`STREAM_NAME`    WHERE _airbyte_extracted_at >= max_extracted_at;

This approach is mentioned in the forum post [BigQuery SQL optimization](https://discuss.airbyte.io/t/1515). 2. Use Partitioned Tables: Partitioning your tables can significantly reduce the amount of data scanned by queries. By partitioning the table on the

_airbyte_extracted_at

column, you can limit the query to only scan the relevant partitions. This is particularly useful for incremental data loads. Example of creating a partitioned table:

Copy code

sql    CREATE OR REPLACE TABLE `your_dataset.your_table`    PARTITION BY DATE(_airbyte_extracted_at)    AS    SELECT * FROM `your_dataset.your_source_table`;

This recommendation is aligned with the forum post [Destination BigQuery - Deduped + history generates too much processing costs](https://discuss.airbyte.io/t/2445). 3. Clustering: Clustering your tables can also help improve query performance by organizing the data in a way that makes it more efficient to scan. For example, clustering on

_airbyte_extracted_at

and other relevant columns can help reduce the amount of data processed. Example of creating a clustered table:

Copy code

sql    CREATE OR REPLACE TABLE `your_dataset.your_table`    CLUSTER BY _airbyte_extracted_at, _airbyte_unique_key_scd    AS    SELECT * FROM `your_dataset.your_source_table`;

This approach is discussed in the [Github Issue on 1s1t Performance Testing](https://github.com/airbytehq/airbyte/issues/27571). By implementing these optimizations, you can reduce the costs associated with loading and querying data in BigQuery. If you need further assistance, you can refer to the [BigQuery documentation on partitioned tables](https://cloud.google.com/bigquery/docs/partitioned-tables) and [clustering](https://cloud.google.com/bigquery/docs/clustered-tables) for more details.

Lucas Segers

05/29/2024, 6:44 PM

@kapa.ai how do I customize this query? it's auto generated when airbyte tries to sync my table

kapa.ai

05/29/2024, 6:44 PM

To customize the auto-generated SQL query used by Airbyte during the sync process, you can follow these steps: 1. Export the SQL Files: First, you need to export the SQL files from the Docker image. This will allow you to access and modify the SQL queries. 2. Edit the SQL Query: Once you have the SQL files, you can edit the query as needed. For example, you can optimize the query by assigning the

max(_airbyte_emitted_at)

to a variable to avoid a full table scan, as mentioned in the forum post. 3. Replace the SQL Files: After editing the SQL query, you need to put the modified file back into the Docker image. Here is a step-by-step guide based on the information provided: ### Step 1: Export the SQL Files You can use Docker commands to copy the SQL files from the Docker container to your local machine. For example:

Copy code

sh docker cp <container_id>:/path/to/sql/file.sql /local/path/to/sql/file.sql

### Step 2: Edit the SQL Query Open the exported SQL file in your preferred text editor and make the necessary changes. For example, you can modify the part of the query that selects

max(_airbyte_emitted_at)

to use a variable instead:

Copy code

sql DECLARE max_emitted_at TIMESTAMP; SET max_emitted_at = (SELECT MAX(CAST(_airbyte_emitted_at AS TIMESTAMP)) FROM `marketplace-reporting`.airbyte.`lazada_raw_orders_scd`);  CREATE OR REPLACE TABLE `marketplace-reporting`.airbyte.`lazada_raw_orders_scd__dbt_tmp` PARTITION BY RANGE_BUCKET(     _airbyte_active_row,     GENERATE_ARRAY(0, 1, 1) ) CLUSTER BY _airbyte_unique_key_scd, _airbyte_emitted_at OPTIONS(     expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 HOUR) ) AS (     WITH new_data AS (         SELECT *         FROM `marketplace-reporting`._airbyte_airbyte.`lazada_raw_orders_stg`         WHERE 1 = 1         AND COALESCE(             CAST(_airbyte_emitted_at AS TIMESTAMP) >= max_emitted_at,             TRUE         )     )     -- Additional query logic here );

### Step 3: Replace the SQL Files After editing the SQL file, copy it back to the Docker container:

Copy code

sh docker cp /local/path/to/sql/file.sql <container_id>:/path/to/sql/file.sql

### Additional Resources For more detailed instructions on customizing SQL queries in Airbyte, you can refer to the [Airbyte documentation on transformations with SQL](https://docs.airbyte.com/operator-guides/transformation-and-normalization/transformations-with-sql#customize-sql-query). If you encounter any issues or need further assistance, you can also check the [Airbyte community forum](https://discuss.airbyte.io) for additional support and examples. By following these steps, you should be able to customize the auto-generated SQL query to better suit your needs and optimize performance.

Open in Slack

Previous Next