< kapa ai> If I have a source and some streams are full refr Airbyte #ask-ai

<@U04SE163WC8> If I have a source and some streams...

Alexander Gregory

06/03/2024, 5:23 AM

@kapa.ai If I have a source and some streams are full-refresh append, how can I update the source connector without it deleting the data? It always seems to recommend stream resets/refreshes and it isn't clear whether this will truncate the tables and hence delete historical data I need.

kapa.ai

06/03/2024, 5:23 AM

To update a source connector without deleting historical data, especially when using full-refresh append streams, you need to be cautious about the steps you take during the update process. Here are some key points and steps to consider: 1. Understand the Sync Modes: - Full Refresh: This mode will copy all rows in the tables and columns you set up for replication every time a sync is run. This can potentially overwrite existing data. - Incremental: This mode will only copy new or updated data, which is safer for preserving historical data. 2. Schema Changes and Data Resets: - When updating a connector, especially if there are schema changes, Airbyte often recommends refreshing the source schema and resetting the data. This can lead to data being overwritten or deleted if not handled properly. 3. Steps to Update Without Data Loss: - Review Schema Changes: Before proceeding with any updates, review the schema changes to understand their impact. - Backup Data: Ensure you have a backup of your historical data before making any changes. - Use Incremental Syncs: If possible, switch to incremental syncs to avoid full data resets. Incremental syncs are recommended when feasible ([source](https://docs.airbyte.com/connector-development/cdk-python/full-refresh-stream#incremental-streams)). - Resumable Full Refresh: Consider using resumable full refresh streams, which offer improved resilience against errors by checkpointing the stream's page number or cursor ([source](https://docs.airbyte.com/connector-development/cdk-python/full-refresh-stream#resumable-full-refresh-streams)). 4. Updating the Connector: - Follow the steps to refresh schemas and reset data as outlined in the [Microsoft Teams Migration Guide](https://docs.airbyte.com/integrations/sources/microsoft-teams-migrations#upgrading-to-100). However, be cautious with the reset step: 1. Select Connections in the main navbar. 2. Select the connection(s) affected by the update. 3. Select the Replication tab, then select Refresh source schema. 4. Review any detected schema changes and select OK. 5. At the bottom of the page, select Save changes. 6. If prompted to Reset all streams, consider the impact carefully. This step is highly recommended for

Full Refresh | Append

sync mode to avoid inconsistencies, but it may lead to data loss. 5. Avoiding Data Loss: - If you must reset the streams, ensure that your destination is configured to handle the data appropriately. For example, using append-only tables or maintaining historical data separately can help mitigate data loss. By following these steps and being cautious about schema changes and data resets, you can update your source connector while minimizing the risk of deleting historical data.

kapa.ai

06/03/2024, 5:23 AM

Full Refresh | Append

Alexander Gregory

06/03/2024, 5:24 AM

@kapa.ai That doesn't address what I just asked. I have it as APPEND mode!

kapa.ai

06/03/2024, 5:24 AM

A message from kapa.ai

kapa.ai

06/03/2024, 5:24 AM

To update a source connector without deleting historical data when using Full Refresh | Append mode, you need to ensure that the data is appended rather than overwritten. Here are the steps and considerations to achieve this: 1. Understand Full Refresh | Append Mode: - In Full Refresh | Append mode, each sync will append the new data to the existing data in the destination without deleting the old data. This mode is designed to preserve historical data. 2. Avoiding Data Resets: - When updating the source connector, avoid performing a full reset of the streams unless absolutely necessary. A full reset can lead to data being reloaded and potentially duplicated, but it should not delete historical data in append mode. 3. Steps to Update the Connector: - Refresh Source Schema: If there are schema changes, you can refresh the source schema without resetting the data. 1. Go to Connections in the main navbar. 2. Select the connection(s) affected by the update. 3. Go to the Replication tab and select Refresh source schema. 4. Review any detected schema changes and select OK. 5. Save the changes at the bottom of the page. 4. Handling Schema Changes: - If schema changes are detected, carefully review them to ensure they do not disrupt the existing data structure. Schema changes should be compatible with the append mode to avoid issues. 5. Backup Data: - Before making any updates, ensure you have a backup of your historical data. This is a precautionary step to prevent data loss in case of unexpected issues. 6. Monitor Syncs: - After updating the connector, monitor the syncs to ensure that data is being appended correctly and no historical data is being deleted. By following these steps, you can update your source connector while ensuring that historical data is preserved in Full Refresh | Append mode. For more detailed guidance, you can refer to the [Airbyte documentation on sync modes](https://docs.airbyte.com/understanding-airbyte/connections/full-refresh-append).

Alexander Gregory

06/03/2024, 5:27 AM

@kapa.ai I have a connection that used to take ~5 minutes to run and I haven't updated it. It now seems to take an hour and I have no idea why? The logs don't seem to imply much, other than it seems normalisation is taking a long time

kapa.ai

06/03/2024, 5:27 AM

A message from kapa.ai

kapa.ai

06/03/2024, 5:28 AM

If your connection has suddenly started taking significantly longer to run, and the logs indicate that normalization is taking a long time, there are several potential causes and steps you can take to diagnose and address the issue: 1. Check Resource Allocation: - Ensure that your Airbyte instance has sufficient memory and CPU resources. Insufficient resources can lead to slow performance. - You can monitor the resource usage using tools like

docker stats

if you are running Airbyte in Docker. For example:

Copy code

shell      docker stats

- Review the memory and CPU usage of the Airbyte containers to see if they are being throttled or running out of resources ([source](https://discuss.airbyte.io/t/2536)). 2. Review Normalization Logs: - Look at the normalization logs to identify any specific steps that are taking longer than expected. This can help pinpoint where the delay is occurring. - For example, if you see long gaps between logs like "Concurrency: 8 threads" and the start of model execution, it may indicate an issue with dbt parsing or model execution ([source](https://github.com/airbytehq/airbyte/issues/18317)). 3. Investigate Schema Changes: - If there have been any changes to the schema of the source or destination, it could impact the normalization process. Ensure that the schema changes are compatible and do not introduce inefficiencies. 4. Check for Memory Issues: - Sometimes, memory issues can cause slow performance. Adding more memory to the server pod has been known to resolve such issues ([source](https://discuss.airbyte.io/t/4686)). 5. Update Airbyte Version: - Ensure you are running the latest version of Airbyte. Updates often include performance improvements and bug fixes that could resolve your issue. - For example, upgrading to the latest version can help optimize queries executed to the database, reducing the memory footprint ([source](https://discuss.airbyte.io/t/2536)). 6. Review Data Volume: - Check if the volume of data being synced has increased. Even small increases in data volume can significantly impact performance if the system is not optimized for it. 7. Consider Partial Parsing in dbt: - If you are using dbt for normalization, consider enabling partial parsing to improve performance. Full parsing can be time-consuming, and partial parsing can help reduce the time taken ([source](https://github.com/airbytehq/airbyte/issues/18317)). By following these steps, you should be able to identify and address the cause of the increased sync time. If the issue persists, you may want to reach out to the Airbyte community or support for further assistance.

Alexander Gregory

06/07/2024, 6:49 AM

@kapa.ai How do I know if my source/destination doesn't delete my data when I need to upgrade/reset streams? What if I am running something as an incremental or full-refresh append. Truncating the table Airbyte creates would be really bad

kapa.ai

06/07/2024, 6:49 AM

A message from kapa.ai

kapa.ai

06/07/2024, 6:49 AM

To ensure that your source/destination does not delete your data when you need to upgrade or reset streams, especially when using Incremental or Full Refresh | Append modes, follow these guidelines: ### Understanding Sync Modes 1. Incremental Sync Mode: - Incremental sync mode only copies new or updated data since the last sync. It does not delete existing data in the destination. - Ensure that the cursor column (e.g.,

updated_at

) is correctly set to track changes. 2. Full Refresh | Append Mode: - Full Refresh | Append mode appends new data to the existing data in the destination without deleting the old data. - This mode is designed to preserve historical data by continuously adding new records. ### Steps to Safeguard Data During Upgrades/Resets 1. Backup Data: - Always backup your destination data before performing any upgrades or resets. This ensures you have a copy of your historical data in case of any issues. 2. Review Schema Changes: - When refreshing the source schema, carefully review any detected schema changes. Ensure that these changes do not disrupt the existing data structure. - Follow the steps to refresh the schema without resetting the data: 1. Go to Connections in the main navbar. 2. Select the connection(s) affected by the update. 3. Go to the Replication tab and select Refresh source schema. 4. Review any detected schema changes and select OK. 5. Save the changes at the bottom of the page. 3. Avoid Full Resets: - Avoid performing a full reset of the streams unless absolutely necessary. A full reset can lead to data being reloaded and potentially duplicated, but it should not delete historical data in append mode. 4. Monitor Syncs: - After updating the connector, monitor the syncs to ensure that data is being appended correctly and no historical data is being deleted. ### Additional Considerations - Testing Destinations V2 for a Single Connection: - If you are testing new features or updates, you can create a separate connection with a

Manual

replication frequency to verify the format of updated data without affecting the existing data ([source](https://docs.airbyte.com/release_notes/upgrading_to_destinations_v2#testing-destinations-v2-for-a-single-connection)). - Handling Data Resets: - If you need to reset streams, consider the impact carefully. For

Full Refresh | Append

kapa.ai

06/07/2024, 6:49 AM

updated_at

Manual

Full Refresh | Append

sync mode, resetting streams is highly recommended to avoid inconsistencies, but it may lead to data loss if not handled properly ([source](https://discuss.airbyte.io/t/3955)). By following these guidelines, you can upgrade or reset streams while minimizing the risk of data loss. Always ensure you have a backup and carefully review any changes before applying them. For more detailed guidance, you can refer to the [Airbyte documentation on sync modes](https://docs.airbyte.com/understanding-airbyte/connections/full-refresh-append) and [resetting data](https://docs.airbyte.com/operator-guides/reset).

Alexander Gregory

06/07/2024, 7:00 AM

@kapa.ai Without normalisation on, would say the typeform to redshift create a table per stream with columns from the data, or would it just be the _airbyte columns? Basically, would I have to unnest a large JSON object?

kapa.ai

06/07/2024, 7:00 AM

A message from kapa.ai

kapa.ai

06/07/2024, 7:00 AM

Without normalization enabled, the data from Typeform to Redshift will be written to a table with only the

_airbyte

columns and a JSON blob containing all the data. You would indeed have to unnest a large JSON object yourself. ### Table Structure Without Normalization When normalization is not enabled, each stream will be output into its own raw table in Redshift. Each table will contain the following columns: - `_airbyte_raw_id`: a UUID assigned by Airbyte to each event that is processed. The column type in Redshift is

VARCHAR

. - `_airbyte_extracted_at`: a timestamp representing when the event was pulled from the data source. The column type in Redshift is

TIMESTAMP WITH TIME ZONE

. - `_airbyte_loaded_at`: a timestamp representing when the row was processed into the final table. The column type in Redshift is

TIMESTAMP WITH TIME ZONE

. - `_airbyte_data`: a JSON blob representing the event data. The column type in Redshift is

SUPER

. Here is an example of what the table structure would look like:

Copy code

sql CREATE TABLE your_stream_name (   _airbyte_raw_id VARCHAR,   _airbyte_extracted_at TIMESTAMP WITH TIME ZONE,   _airbyte_loaded_at TIMESTAMP WITH TIME ZONE,   _airbyte_data SUPER );

### Unnesting the JSON Object Since the data is stored as a JSON blob in the

_airbyte_data

column, you will need to unnest this JSON object to access the individual fields. This can be done using SQL queries that parse the JSON data. For example, in Redshift, you can use the

json_extract_path_text

function to extract specific fields from the JSON blob:

Copy code

sql SELECT   json_extract_path_text(_airbyte_data, 'field_name') AS field_name FROM   your_stream_name;

### Enabling Normalization If you want Airbyte to handle the unnesting and structuring of the data for you, you can enable Basic Normalization. When enabled, Airbyte will transform the raw JSON blob into a structured format appropriate for your destination, creating tables with columns corresponding to the fields in your data. For more information on how normalization works and how to enable it, you can refer to the [Airbyte documentation on Basic Normalization](https://docs.airbyte.com/using-airbyte/core-concepts/basic-normalization#high-level-overview). By understanding these details, you can decide whether to handle the JSON unnesting yourself or enable normalization to automate the process.

Alexander Gregory

06/07/2024, 7:02 AM

@kapa.ai Yes, but it seems if normalisation is on, like 60 tables are created? Is there a way to control the depth of this, i.e. 1 such that normalisation is actually just creating a single table

kapa.ai

06/07/2024, 7:02 AM

A message from kapa.ai

kapa.ai

06/07/2024, 7:02 AM

Yes, you are correct that enabling normalization can result in the creation of multiple tables, especially if the source data contains nested JSON objects. Currently, Airbyte's Basic Normalization can create a large number of tables by expanding nested JSON objects. ### Controlling the Depth of Normalization As of now, there isn't a built-in feature in Airbyte to control the depth at which Basic Normalization expands nested JSON objects. However, there are ongoing efforts to improve this aspect of normalization. #### Upcoming Changes to Normalization According to a discussion on GitHub, Airbyte is planning significant updates to the normalization process. One of the changes includes stopping the unnesting of JSON completely. The first destination to receive this update will be BigQuery, with plans to extend it to other destinations in the future ([source](https://github.com/airbytehq/airbyte/discussions/28407)). #### Customizing Normalization with dbt In the meantime, you can customize the dbt normalization project to control the depth of unnesting. This involves exporting the dbt normalization project and modifying it to suit your needs. 1. Export the dbt Normalization Project: - You can export the dbt normalization project from Airbyte and modify it to prevent the creation of nested tables. 2. Modify the dbt Project: - Remove or adjust the parts of the dbt project that handle the unnesting of nested JSON objects. This way, you can keep only the high-level tables. Here is a link to a tutorial on how to work with dbt normalization in Airbyte: [Normalization Tutorial](https://docs.airbyte.com/operator-guides/transformation-and-normalization/). ### Example of Customizing dbt If you want to stop unnesting nested JSON objects, you can modify the dbt models to only include the top-level fields. For example, if your JSON data looks like this:

Copy code

json {   "id": "123",   "name": "John Doe",   "address": {     "city": "New York",     "zip": "10001"   } }

You can modify the dbt model to only include the

id

and

name

fields, and keep the

address

field as a JSON object:

Copy code

sql SELECT   id,   name,   address FROM   {{ source('your_stream_name') }}

By customizing the dbt project, you can control the depth of normalization and reduce the number of tables created. ### Future Updates Keep an eye on the Airbyte GitHub repository and discussions for updates on the normalization process. The planned changes will make it easier to control the depth of normalization and reduce the number of tables created. For more information, you can refer to the relevant GitHub discussion: [Controlling depth at which Basic Normalization expands nested JSON](https://github.com/airbytehq/airbyte/discussions/28407). By following these steps, you can manage the depth of normalization and ensure that it creates a single table or fewer tables as needed.

Open in Slack

Previous Next