Issue Title: Airbyte Worker Failing to Run Replica...
# ask-ai
o
Issue Title: Airbyte Worker Failing to Run Replication-Orchestrator on EKS Is this your first time deploying Airbyte?: No OS Version / Instance: Ubuntu Memory / Disk: 8GB / 50GB Deployment: Kubernetes EKS AWS Airbyte Version: 0.59.1 Source name/version: All Destination name/version: Redshift Step to Reproduce: 1. Deploy Airbyte on EKS following the official documentation. 2. Create a connection with any source and Redshift as the destination. 3. Run the connection. 4. Observe the failure in the Airbyte worker logs. Description: We have been experiencing consistent failures with the Airbyte worker while attempting to run the
replication-orchestrator
on our EKS deployment. The issue occurs across all connections and sources. Below is the relevant log excerpt:
Copy code
sql

Running the launcher replication-orchestrator failed

message='io.temporal.serviceclient.CheckedExceptionWrapper: io.airbyte.workers.exception.WorkerException: Running the launcher replication-orchestrator failed', type='java.lang.RuntimeException', nonRetryable=false

Received failed response: 404, Request ID: XXXXXXXXXXXX, Extended Request ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX==
Received failed response: 404, Request ID: XXXXXXXXXXXX, Extended Request ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX==
Creating orchestrator-repl-job-2568-attempt-1 for attempt number: 1
There are currently running pods for the connection: [orchestrator-repl-job-2568-attempt-0]. Killing these pods to enforce one execution at a time.
Attempting to delete pods: [orchestrator-repl-job-2568-attempt-0]
Waiting for deletion...
Successfully deleted all running pods for the connection!
Waiting for pod to be running...
Ready to run resync and reflector for v1/namespaces/default/pods with resync 0
Resync skipped due to 0 full resync period for v1/namespaces/default/pods
The error seems to occur during the replication orchestration process. The worker attempts to run the job but consistently fails, triggering a retry mechanism with a backoff period. We have checked the following potential causes without success: • Network connectivity issues • Resource constraints (CPU and memory) • Configuration issues • Service availability (Temporal server and other dependent services) • IAM roles and permissions Actions Taken: 1. Verified network connectivity between the pods and the Temporal server. 2. Ensured sufficient resource allocation to the worker pods. 3. Checked the health and logs of dependent services. 4. Reviewed IAM roles and permissions for the worker pods. 5. Upgraded to the latest Airbyte version. Despite these efforts, the issue persists. We need assistance in diagnosing and resolving this problem to ensure stable and reliable data replication. Environment Details: • Kubernetes Cluster: EKS • Instance Type: t2.xlarge
k
Based on the information provided and the relevant knowledge sources, it appears that the issue you're experiencing with the Airbyte worker failing to run the
replication-orchestrator
on your EKS deployment is a known problem that has been reported by other users as well. Here are some steps and considerations to help diagnose and potentially resolve the issue: ### Steps to Diagnose and Resolve the Issue 1. Upgrade to the Latest Version: Ensure that you are running the latest version of Airbyte. There have been significant refactors and bug fixes in recent versions. Upgrading might resolve the issue. - [Forum Post: Upgrade to the latest version](https://discuss.airbyte.io/t/3823) 2. Check for Pre-existing Pods: The error message indicates that there might be pre-existing pods that didn't advance out of the
NOT_STARTED
state. Ensure that there are no lingering pods from previous attempts. - [Forum Post: Pre-existing pod issue](https://discuss.airbyte.io/t/3823) 3. Review Helm Chart Versions: Users have reported stability issues with Helm chart versions later than 0.49.6. If you are using a newer Helm chart version, consider downgrading to 0.49.6 to see if it resolves the issue. - [Github Issue: Helm Chart Stability](https://github.com/airbytehq/airbyte/issues/32203) 4. Check Kubernetes Client Logs: Review the Kubernetes client logs for any issues related to pod creation or handling. There might be retries or failures in creating or replacing pods. - [Github Issue: Kubernetes Client Logs](https://github.com/airbytehq/airbyte/issues/18888) 5. Network and Resource Allocation: Ensure that there are no network connectivity issues between the pods and the Temporal server. Also, verify that the worker pods have sufficient CPU and memory resources allocated. - [Forum Post: Network and Resource Checks](https://discuss.airbyte.io/t/3823) 6. IAM Roles and Permissions: Double-check the IAM roles and permissions assigned to the worker pods to ensure they have the necessary access to perform their tasks. - [Forum Post: IAM Roles and Permissions](https://discuss.airbyte.io/t/3823)
Based on the information provided and the relevant knowledge sources, it appears that the issue you're experiencing with the Airbyte worker failing to run the
replication-orchestrator
on your EKS deployment is a known problem that has been reported by other users as well. Here are some steps and considerations to help diagnose and potentially resolve the issue: ### Steps to Diagnose and Resolve the Issue 1. Upgrade to the Latest Version: Ensure that you are running the latest version of Airbyte. There have been significant refactors and bug fixes in recent versions. Upgrading might resolve the issue. - [Forum Post: Upgrade to the latest version](https://discuss.airbyte.io/t/3823) 2. Check for Pre-existing Pods: The error message indicates that there might be pre-existing pods that didn't advance out of the
NOT_STARTED
state. Ensure that there are no lingering pods from previous attempts. - [Forum Post: Pre-existing pod issue](https://discuss.airbyte.io/t/3823) 3. Review Helm Chart Versions: Users have reported stability issues with Helm chart versions later than 0.49.6. If you are using a newer Helm chart version, consider downgrading to 0.49.6 to see if it resolves the issue. - [Github Issue: Helm Chart Stability](https://github.com/airbytehq/airbyte/issues/32203) 4. Check Kubernetes Client Logs: Review the Kubernetes client logs for any issues related to pod creation or handling. There might be retries or failures in creating or replacing pods. - [Github Issue: Kubernetes Client Logs](https://github.com/airbytehq/airbyte/issues/18888) 5. Network and Resource Allocation: Ensure that there are no network connectivity issues between the pods and the Temporal server. Also, verify that the worker pods have sufficient CPU and memory resources allocated. - [Forum Post: Network and Resource Checks](https://discuss.airbyte.io/t/3823) 6. IAM Roles and Permissions: Double-check the IAM roles and permissions assigned to the worker pods to ensure they have the necessary access to perform their tasks. - [Forum Post: IAM Roles and Permissions](https://discuss.airbyte.io/t/3823) ### Example Log Excerpt for Reference
Copy code
shell message='io.temporal.serviceclient.CheckedExceptionWrapper: io.airbyte.workers.exception.WorkerException: Running the launcher replication-orchestrator failed', type='java.lang.RuntimeException', nonRetryable=false Received failed response: 404, Request ID: XXXXXXXXXXXX, Extended Request ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX== Creating orchestrator-repl-job-2568-attempt-1 for attempt number: 1 There are currently running pods for the connection: [orchestrator-repl-job-2568-attempt-0]. Killing these pods to enforce one execution at a time. Attempting to delete pods: [orchestrator-repl-job-2568-attempt-0] Waiting for deletion... Successfully deleted all running pods for the connection! Waiting for pod to be running... Ready to run resync and reflector for v1/namespaces/default/pods with resync 0 Resync skipped due to 0 full resync period for v1/namespaces/default/pods
### Additional Resources If the issue persists, consider reaching out to the Airbyte community on [Slack](https://slack.airbyte.com/) for further assistance. By following these steps, you should be able to diagnose and potentially resolve the issue with the Airbyte worker failing to run the
replication-orchestrator
on your EKS deployment.