<@U04S8PLJWMT> i run a Incremental Sync - Append j...
# ask-ai
j
@kapa.ai i run a Incremental Sync - Append job to sync a mssql db to redshift using cdc, the job runs without errors, and i now the source table has 51million records, but only 43million records are loaded, how do you suggest i troubleshoot this?
k
When you observe a mismatch between the number of records in your MSSQL source (51 million) and the number loaded into Redshift (43 million) after an Incremental Sync - Append job using CDC, here are the recommended troubleshooting steps based on Airbyte documentation and community experience: 1. Check CDC Configuration and Limitations - Ensure CDC is enabled on all relevant tables and that each table has a primary key, as CDC incremental syncing is only available for tables with at least one primary key. Tables without primary keys are only replicated in Full Refresh mode, which could lead to missing records if not handled properly. - CDC only tracks changes made via
INSERT
,
UPDATE
, and
DELETE
. Changes from operations like
TRUNCATE
or
ALTER
are not captured. - Data must be in tables, not views, and CDC must be enabled on each table you want to sync. If new tables or columns were added, ensure CDC is enabled for them and that Airbyte is aware of the schema changes. For new columns, you may need to disable and re-enable CDC on the table to ensure the new column is tracked and included in the sync ([MSSQL CDC docs](https://docs.airbyte.com/integrations/sources/mssql#change-data-capture-cdc), [CDC limitations](https://docs.airbyte.com/platform/next/understanding-airbyte/cdc#limitations), [Adding columns to CDC](https://docs.airbyte.com/integrations/sources/mssql/mssql-troubleshooting#adding-columns-to-existing-tables-with-cdc)). 2. Review Retention Period of CDC Data - By default, SQL Server retains CDC data for only three days. If your initial or incremental sync takes longer, you may lose change data before it is synced, resulting in missing records. Increase the retention period using the
sp_cdc_change_job
stored procedure and restart the cleanup job to avoid data loss ([Extend CDC retention](https://docs.airbyte.com/integrations/sources/mssql#4-extend-the-retention-period-of-cdc-data)). 3. Check for Schema Changes - If columns were added to tables after CDC was enabled, those columns may not be tracked unless CDC is re-enabled for the table. This can result in missing data for those columns or even entire records if the schema is not aligned ([Adding columns to CDC](https://docs.airbyte.com/integrations/sources/mssql/mssql-troubleshooting#adding-columns-to-existing-tables-with-cdc)). 4. Investigate Sync Mode Limitations - The Incremental Sync - Append mode relies on a cursor column. If records are updated without updating the cursor field, those changes will not be picked up. Also, if multiple changes occur between syncs and the sync frequency is not granular enough, some changes may be missed ([Incremental sync limitations](https://docs.airbyte.com/platform/using-airbyte/core-concepts/sync-modes/incremental-append#known-limitations)). 5. Review Logs and Airbyte Job Output - Check the Airbyte job logs for any warnings or errors that might indicate records were skipped or filtered out. Look for any mention of records being filtered, connection issues, or schema mismatches. 6. Consider Initial Snapshot and CDC Overlap - During the initial sync, Airbyte takes a snapshot of the current state and then starts reading CDC logs. If there is a gap between the snapshot and CDC log reading, changes made during that window may be missed. This is a known challenge in CDC-based syncs and can lead to missing records if not handled carefully ([Forum discussion on missing rows](https://github.com/airbytehq/airbyte/issues/53614)). 7. Resource and Performance Issues - For very large tables, performance bottlenecks or timeouts can cause incomplete syncs. Ensure your Airbyte deployment has sufficient resources and consider increasing parallelism if using Kubernetes ([Forum post on slow initial load](https://discuss.airbyte.io/t/1686)). Summary of Actions: - Verify CDC is enabled and configured correctly on all relevant tables and columns. - Increase CDC retention period to cover the full duration of your sync. - Check for schema changes and re-enable CDC if needed. - Review Airbyte logs for errors or skipped records. - Ensure your sync frequency and resources are sufficient for your data volume. If after these checks the issue persists, consider running a Full Refresh sync to reconcile the record counts, and open a support ticket with Airbyte including your logs and configuration details for further assistance.
j
@kapa.ai, i meant full refresh sync
k
A message from kapa.ai
If you are running a Full Refresh - Append sync from MSSQL to Redshift and see that only 43 million out of 51 million records are loaded, here are the recommended troubleshooting steps based on Airbyte documentation: 1. Understand Full Refresh - Append Behavior In Full Refresh - Append mode, Airbyte reads all available data from the source and appends it to the destination table. Each sync will add all records from the source, potentially creating duplicates if run multiple times. It does not remove or overwrite existing data in the destination, but it should always load all records from the source during each sync. If you see fewer records in the destination, something is likely interrupting the process or filtering records out unexpectedly [Full Refresh - Append](https://docs.airbyte.com/platform/using-airbyte/core-concepts/sync-modes/full-refresh-append). 2. Check for Sync Errors or Warnings Even if the job reports as successful, review the Airbyte job logs for any warnings or errors that might indicate records were skipped, filtered, or failed to load. Sometimes, issues like timeouts, memory limits, or destination-side constraints (such as primary key violations) can silently cause data loss. 3. Destination Table Constraints If your Redshift destination table has constraints (e.g., primary keys, unique indexes), duplicate or conflicting records from the source may be rejected during the insert process. This could result in fewer records being loaded than expected. 4. Data Type or Schema Mismatches If there are schema mismatches or data type incompatibilities between MSSQL and Redshift, some records may fail to load. Check the Airbyte logs for any schema-related errors. 5. Batch or Network Issues For very large tables, network interruptions or resource limitations (memory, disk, etc.) can cause incomplete data loads. Ensure your Airbyte deployment and Redshift cluster have sufficient resources for large data transfers. 6. Verify Source Data Double-check that the source table in MSSQL actually contains 51 million records at the time the sync starts. If records are being added or removed during the sync, the count may differ. 7. Review Normalization and Transformation Steps If you are using Airbyte's normalization, check the generated SQL and logs to ensure that no records are being filtered out during this process. The normalization step can sometimes exclude records based on certain criteria [Normalization SQL example](https://airbyte.com/tutorials/full-data-synchronization#step-8). If after these checks you still cannot identify the cause, try running the sync again and closely monitor the logs for any anomalies. If possible, perform a test sync to a new, empty table to rule out issues with existing data or constraints in the destination. If the problem persists, consider reaching out to Airbyte support with your logs and configuration details for further assistance. > In summary: - Review Airbyte logs for errors or warnings - Check destination table constraints - Ensure schema compatibility - Verify source record count - Monitor resource usage during sync - Review normalization steps For more details on Full Refresh - Append mode and troubleshooting, see the official documentation: [Full Refresh - Append](https://docs.airbyte.com/platform/using-airbyte/core-concepts/sync-modes/full-refresh-append).