Srini
02/10/2025, 5:09 AMkapa.ai
02/10/2025, 5:09 AMSrini
02/10/2025, 5:11 AMkapa.ai
02/10/2025, 5:11 AMkapa.ai
02/10/2025, 5:11 AMyaml worker: replicaCount: ## e.g. 2
[Documentation > Enterprise-setup > Scaling Airbyte After Installation](https://docs.airbyte.com/enterprise-setup/scaling-airbyte) 3. Concurrent Sync Limits: Airbyte allows you to configure the maximum number of concurrent jobs that can run across the system. This is done through environment variables: yaml worker: extraEnvs: - name: MAX_SYNC_WORKERS value: ## e.g. 5 - name: MAX_CHECK_WORKERS value: ## e.g. 5
[Documentation > Enterprise-setup > Scaling Airbyte After Installation](https://docs.airbyte.com/enterprise-setup/scaling-airbyte) 4. Load Balancing: For Airbyte Cloud and Enterprise users, there's a load-balancing strategy that allows different data-plane instances to compete for workloads, enabling a horizontally-scaling, self-healing setup [Workloads - Airbyte's Next Generation Job Framework.](https://github.com/airbytehq/airbyte/discussions/42947). 5. Job Distribution: The number of concurrent syncs that can run is calculated as: parallel syncs = number of worker pods * min(MAX_SYNC_WORKERS, TEMPORAL_WORKER_PORTS/4) [Forum Post: How to configure to have a larger number of concurrency? > Answers Part 5](https://discuss.airbyte.io/t/1077). It's important to note that the exact distribution mechanism may vary depending on your Airbyte version and deployment configuration. The Workloads framework, which is Airbyte's next-generation job architecture, aims to improve scalability and resilience by decoupling the number of running jobs from the number of jobs that can be started [Workloads - Airbyte's Next Generation Job Framework.](https://github.com/airbytehq/airbyte/discussions/42947). For the most up-to-date and detailed information on load distribution in your specific Airbyte setup, it would be best to consult the documentation for your current version or reach out to Airbyte support.Srini
02/10/2025, 5:14 AMkapa.ai
02/10/2025, 5:14 AMkapa.ai
02/10/2025, 5:14 AMMAX_SYNC_WORKERS
setting and the number of worker replicas [Documentation > Enterprise-setup > Scaling Airbyte After Installation](https://docs.airbyte.com/enterprise-setup/scaling-airbyte). 3. Airbyte's architecture allows for horizontal scaling of sync workloads using Kubernetes, but the specifics of handling petabyte-scale data are not detailed [Forum Post: How does Airbyte handle load massive volume of data?](https://discuss.airbyte.io/t/813). 4. For large data volumes, memory can be a significant constraint, especially for database sources with large row sizes [Scaling Airbyte | Airbyte Documentation](https://docs.airbyte.com/operator-guides/scaling-airbyte#memory). Given the lack of specific information about handling petabyte-scale data syncs, it would be advisable to consult directly with Airbyte's support team or engage with their community for guidance on managing such large-scale operations. They may be able to provide custom solutions or best practices for distributing the load of extremely large sync jobs across worker pods.Srini
02/10/2025, 5:26 AMkapa.ai
02/10/2025, 5:26 AMkapa.ai
02/10/2025, 5:26 AMyaml worker: replicaCount: ## increase this number, e.g. 5
[Documentation > Enterprise-setup > Scaling Airbyte After Installation](https://docs.airbyte.com/enterprise-setup/scaling-airbyte) 2. Adjust MAX_SYNC_WORKERS: Increase the MAX_SYNC_WORKERS environment variable to allow more concurrent sync jobs: yaml worker: extraEnvs: - name: MAX_SYNC_WORKERS value: ## increase this number, e.g. 10
[Documentation > Enterprise-setup > Scaling Airbyte After Installation](https://docs.airbyte.com/enterprise-setup/scaling-airbyte) 3. Parallelize at the connection level: For sources with multiple streams, you can create separate connections for each stream to sync them in parallel. This approach was suggested in a forum post about improving sync performance [Forum Post: Source MSSQL - initial load is very slow (CDC run)](https://discuss.airbyte.io/t/1686). 4. Use the Concurrent CDK: Some sources (e.g., Stripe) already support parallel processing using Airbyte's concurrent CDK. This allows for parallelization within a single connection [Github Issue (open): [MOVE EPIC] Parallel Replication](https://github.com/airbytehq/airbyte/issues/7750). 5. Consider breaking long syncs: Although not currently implemented, there's a proposal to break long and large data syncs into multiple smaller syncs. This could help with resumability and reduce the impact of transient failures [Github Issue (open): Break long and large data syncs into multiple small syncs](https://github.com/airbytehq/airbyte/issues/13783). 6. Optimize resource allocation: Ensure that your Kubernetes cluster has sufficient resources to handle the increased parallelism. You may need to adjust CPU and memory allocations for job containers [Scaling Airbyte | Airbyte Documentation](https://docs.airbyte.com/operator-guides/scaling-airbyte#memory). Remember that the effectiveness of these strategies may vary depending on your specific data sources, volume, and infrastructure. It's important to monitor your system's performance and adjust accordingly. For very large jobs (e.g., petabyte-scale), you may need to consult with Airbyte support for more tailored solutions.