Alexander Gregory
06/03/2024, 5:23 AMkapa.ai
06/03/2024, 5:23 AMFull Refresh | Append
sync mode to avoid inconsistencies, but it may lead to data loss.
5. Avoiding Data Loss:
- If you must reset the streams, ensure that your destination is configured to handle the data appropriately. For example, using append-only tables or maintaining historical data separately can help mitigate data loss.
By following these steps and being cautious about schema changes and data resets, you can update your source connector while minimizing the risk of deleting historical data.kapa.ai
06/03/2024, 5:23 AMFull Refresh | Append
sync mode to avoid inconsistencies, but it may lead to data loss. 5. Avoiding Data Loss: - If you must reset the streams, ensure that your destination is configured to handle the data appropriately. For example, using append-only tables or maintaining historical data separately can help mitigate data loss. By following these steps and being cautious about schema changes and data resets, you can update your source connector while minimizing the risk of deleting historical data. For more detailed guidance, you can refer to the [Airbyte documentation on resetting data](https://docs.airbyte.com/operator-guides/reset).Alexander Gregory
06/03/2024, 5:24 AMkapa.ai
06/03/2024, 5:24 AMkapa.ai
06/03/2024, 5:24 AMAlexander Gregory
06/03/2024, 5:27 AMkapa.ai
06/03/2024, 5:27 AMkapa.ai
06/03/2024, 5:28 AMdocker stats
if you are running Airbyte in Docker. For example: shell docker stats
- Review the memory and CPU usage of the Airbyte containers to see if they are being throttled or running out of resources ([source](https://discuss.airbyte.io/t/2536)). 2. Review Normalization Logs: - Look at the normalization logs to identify any specific steps that are taking longer than expected. This can help pinpoint where the delay is occurring. - For example, if you see long gaps between logs like "Concurrency: 8 threads" and the start of model execution, it may indicate an issue with dbt parsing or model execution ([source](https://github.com/airbytehq/airbyte/issues/18317)). 3. Investigate Schema Changes: - If there have been any changes to the schema of the source or destination, it could impact the normalization process. Ensure that the schema changes are compatible and do not introduce inefficiencies. 4. Check for Memory Issues: - Sometimes, memory issues can cause slow performance. Adding more memory to the server pod has been known to resolve such issues ([source](https://discuss.airbyte.io/t/4686)). 5. Update Airbyte Version: - Ensure you are running the latest version of Airbyte. Updates often include performance improvements and bug fixes that could resolve your issue. - For example, upgrading to the latest version can help optimize queries executed to the database, reducing the memory footprint ([source](https://discuss.airbyte.io/t/2536)). 6. Review Data Volume: - Check if the volume of data being synced has increased. Even small increases in data volume can significantly impact performance if the system is not optimized for it. 7. Consider Partial Parsing in dbt: - If you are using dbt for normalization, consider enabling partial parsing to improve performance. Full parsing can be time-consuming, and partial parsing can help reduce the time taken ([source](https://github.com/airbytehq/airbyte/issues/18317)). By following these steps, you should be able to identify and address the cause of the increased sync time. If the issue persists, you may want to reach out to the Airbyte community or support for further assistance.Alexander Gregory
06/07/2024, 6:49 AMkapa.ai
06/07/2024, 6:49 AMkapa.ai
06/07/2024, 6:49 AMupdated_at
) is correctly set to track changes.
2. Full Refresh | Append Mode:
- Full Refresh | Append mode appends new data to the existing data in the destination without deleting the old data.
- This mode is designed to preserve historical data by continuously adding new records.
### Steps to Safeguard Data During Upgrades/Resets
1. Backup Data:
- Always backup your destination data before performing any upgrades or resets. This ensures you have a copy of your historical data in case of any issues.
2. Review Schema Changes:
- When refreshing the source schema, carefully review any detected schema changes. Ensure that these changes do not disrupt the existing data structure.
- Follow the steps to refresh the schema without resetting the data:
1. Go to Connections in the main navbar.
2. Select the connection(s) affected by the update.
3. Go to the Replication tab and select Refresh source schema.
4. Review any detected schema changes and select OK.
5. Save the changes at the bottom of the page.
3. Avoid Full Resets:
- Avoid performing a full reset of the streams unless absolutely necessary. A full reset can lead to data being reloaded and potentially duplicated, but it should not delete historical data in append mode.
4. Monitor Syncs:
- After updating the connector, monitor the syncs to ensure that data is being appended correctly and no historical data is being deleted.
### Additional Considerations
- Testing Destinations V2 for a Single Connection:
- If you are testing new features or updates, you can create a separate connection with a Manual
replication frequency to verify the format of updated data without affecting the existing data ([source](https://docs.airbyte.com/release_notes/upgrading_to_destinations_v2#testing-destinations-v2-for-a-single-connection)).
- Handling Data Resets:
- If you need to reset streams, consider the impact carefully. For Full Refresh | Append
sync mode, resetting streams is highly recommended to avoid inconsistencies, but it may lead to data loss if not handled properly ([source](https://discuss.airbyte.io/t/3955)).
By following these guidelines, you can upgrade or reset streams while minimizing the risk of data loss. Always ensure you have a backup and carefully review any changes before applying them.kapa.ai
06/07/2024, 6:49 AMupdated_at
) is correctly set to track changes. 2. Full Refresh | Append Mode: - Full Refresh | Append mode appends new data to the existing data in the destination without deleting the old data. - This mode is designed to preserve historical data by continuously adding new records. ### Steps to Safeguard Data During Upgrades/Resets 1. Backup Data: - Always backup your destination data before performing any upgrades or resets. This ensures you have a copy of your historical data in case of any issues. 2. Review Schema Changes: - When refreshing the source schema, carefully review any detected schema changes. Ensure that these changes do not disrupt the existing data structure. - Follow the steps to refresh the schema without resetting the data: 1. Go to Connections in the main navbar. 2. Select the connection(s) affected by the update. 3. Go to the Replication tab and select Refresh source schema. 4. Review any detected schema changes and select OK. 5. Save the changes at the bottom of the page. 3. Avoid Full Resets: - Avoid performing a full reset of the streams unless absolutely necessary. A full reset can lead to data being reloaded and potentially duplicated, but it should not delete historical data in append mode. 4. Monitor Syncs: - After updating the connector, monitor the syncs to ensure that data is being appended correctly and no historical data is being deleted. ### Additional Considerations - Testing Destinations V2 for a Single Connection: - If you are testing new features or updates, you can create a separate connection with a Manual
replication frequency to verify the format of updated data without affecting the existing data ([source](https://docs.airbyte.com/release_notes/upgrading_to_destinations_v2#testing-destinations-v2-for-a-single-connection)). - Handling Data Resets: - If you need to reset streams, consider the impact carefully. For Full Refresh | Append
sync mode, resetting streams is highly recommended to avoid inconsistencies, but it may lead to data loss if not handled properly ([source](https://discuss.airbyte.io/t/3955)). By following these guidelines, you can upgrade or reset streams while minimizing the risk of data loss. Always ensure you have a backup and carefully review any changes before applying them. For more detailed guidance, you can refer to the [Airbyte documentation on sync modes](https://docs.airbyte.com/understanding-airbyte/connections/full-refresh-append) and [resetting data](https://docs.airbyte.com/operator-guides/reset).Alexander Gregory
06/07/2024, 7:00 AMkapa.ai
06/07/2024, 7:00 AMkapa.ai
06/07/2024, 7:00 AM_airbyte
columns and a JSON blob containing all the data. You would indeed have to unnest a large JSON object yourself. ### Table Structure Without Normalization When normalization is not enabled, each stream will be output into its own raw table in Redshift. Each table will contain the following columns: - `_airbyte_raw_id`: a UUID assigned by Airbyte to each event that is processed. The column type in Redshift is VARCHAR
. - `_airbyte_extracted_at`: a timestamp representing when the event was pulled from the data source. The column type in Redshift is TIMESTAMP WITH TIME ZONE
. - `_airbyte_loaded_at`: a timestamp representing when the row was processed into the final table. The column type in Redshift is TIMESTAMP WITH TIME ZONE
. - `_airbyte_data`: a JSON blob representing the event data. The column type in Redshift is SUPER
. Here is an example of what the table structure would look like: sql CREATE TABLE your_stream_name ( _airbyte_raw_id VARCHAR, _airbyte_extracted_at TIMESTAMP WITH TIME ZONE, _airbyte_loaded_at TIMESTAMP WITH TIME ZONE, _airbyte_data SUPER );
### Unnesting the JSON Object Since the data is stored as a JSON blob in the _airbyte_data
column, you will need to unnest this JSON object to access the individual fields. This can be done using SQL queries that parse the JSON data. For example, in Redshift, you can use the json_extract_path_text
function to extract specific fields from the JSON blob: sql SELECT json_extract_path_text(_airbyte_data, 'field_name') AS field_name FROM your_stream_name;
### Enabling Normalization If you want Airbyte to handle the unnesting and structuring of the data for you, you can enable Basic Normalization. When enabled, Airbyte will transform the raw JSON blob into a structured format appropriate for your destination, creating tables with columns corresponding to the fields in your data. For more information on how normalization works and how to enable it, you can refer to the [Airbyte documentation on Basic Normalization](https://docs.airbyte.com/using-airbyte/core-concepts/basic-normalization#high-level-overview). By understanding these details, you can decide whether to handle the JSON unnesting yourself or enable normalization to automate the process.Alexander Gregory
06/07/2024, 7:02 AMkapa.ai
06/07/2024, 7:02 AMkapa.ai
06/07/2024, 7:02 AMjson { "id": "123", "name": "John Doe", "address": { "city": "New York", "zip": "10001" } }
You can modify the dbt model to only include the id
and name
fields, and keep the address
field as a JSON object: sql SELECT id, name, address FROM {{ source('your_stream_name') }}
By customizing the dbt project, you can control the depth of normalization and reduce the number of tables created. ### Future Updates Keep an eye on the Airbyte GitHub repository and discussions for updates on the normalization process. The planned changes will make it easier to control the depth of normalization and reduce the number of tables created. For more information, you can refer to the relevant GitHub discussion: [Controlling depth at which Basic Normalization expands nested JSON](https://github.com/airbytehq/airbyte/discussions/28407). By following these steps, you can manage the depth of normalization and ensure that it creates a single table or fewer tables as needed.