<@U04S8PLJWMT> Rinning OSS 1.4.0 syncs are failing...
# ask-ai
d
@kapa.ai Rinning OSS 1.4.0 syncs are failing and the following logs keep repeating and stalling out on attempted jobs:
Copy code
05:02:05.134
Starting pod sweeper cycle:
05:02:05.137
Will sweep succeeded pods from before 2025-02-10T04:52:05,135555682+00:00
05:02:05.140
Will sweep unsuccessful pods from before 2025-02-10T03:02:05,138479147+00:00
05:02:05.150
10.50.26.30 - - [10/Feb/2025:05:02:05 +0000] "GET /index.html HTTP/1.1" 200 25279 "-" "kube-probe/1.30+" "-"
05:02:05.232
Completed pod sweeper cycle.  Sleeping for 60 seconds...
05:02:06.495
2025-02-10 05:02:06,495 [scheduled-executor-thread-1] INFO i.a.c.j.WorkloadMonitor(cancelRunningForTooLongNonSyncWorkloads):137 - Checking for workloads running for too long with timeout value PT4H
05:02:06.495
2025-02-10 05:02:06,495 [scheduled-executor-thread-13] INFO i.a.c.j.WorkloadMonitor(cancelRunningForTooLongSyncWorkloads):157 - Checking for sync workloads running for too long with timeout value PT720H
05:02:06.495
2025-02-10 05:02:06,495 [scheduled-executor-thread-9] INFO i.a.c.j.WorkloadMonitor(cancelNotHeartbeatingWorkloads):110 - Checking for non heartbeating workloads.
05:02:06.495
2025-02-10 05:02:06,495 [scheduled-executor-thread-16] INFO i.a.c.j.WorkloadMonitor(cancelNotClaimedWorkloads):84 - Checking for not claimed workloads.
05:02:06.496
2025-02-10 05:02:06,495 [scheduled-executor-thread-8] INFO	i.a.c.j.WorkloadMonitor(cancelNotStartedWorkloads):59 - Checking for not started workloads.
k
Based on the logs you've provided, it appears that your Airbyte OSS 1.4.0 syncs are not actually failing. The logs you've shared are showing normal operation of Airbyte's pod sweeper and workload monitoring processes. These are routine background tasks that Airbyte performs to manage its resources and ensure smooth operation. However, if your syncs are indeed failing, the logs you've provided don't contain information about those failures. To troubleshoot sync failures, you would need to look at the specific logs for the failed sync jobs. To investigate the failing syncs, I recommend the following steps: 1. Check the logs for the specific sync jobs that are failing. You can find these in the Airbyte UI under the "Job History" tab for each connection. 2. Look for error messages or stack traces in these logs that might indicate why the syncs are failing. 3. If you're running Airbyte on Kubernetes, you might also want to check the logs of the pods running the sync jobs. For more detailed information on how to access and interpret Airbyte logs, you can refer to the [Airbyte documentation on browsing output logs](https://docs.airbyte.com/operator-guides/browsing-output-logs). If you're still having trouble after checking the specific sync job logs, you might want to open an issue on the [Airbyte GitHub repository](https://github.com/airbytehq/airbyte) with more details about the failures you're experiencing.
d
@kapa.ai when the sync job is kicked off, the pod sweeping cycles start but they run continuously and the rest of the airbyte sync job does not start. Is that supposed to happen? The errors I see in Airbyte relate to timeout logic
k
A message from kapa.ai
Based on the information you've provided, it seems that the behavior you're experiencing is not the expected operation of Airbyte. The pod sweeping cycles are a normal background process, but they shouldn't prevent sync jobs from starting or cause continuous timeouts. Let's explore this issue further based on the available knowledge sources. The pod sweeper is designed to clean up completed or failed pods periodically, as mentioned in the [GitHub issue #16589](https://github.com/airbytehq/airbyte/issues/16589). However, it shouldn't interfere with the execution of new sync jobs. Regarding the timeout errors you're seeing, there are a few potential causes and solutions to consider: 1. Init container timeout: There's a known issue where the init container in job pods can time out, especially when new nodes are being provisioned. This was discussed in [GitHub issue #22907](https://github.com/airbytehq/airbyte/issues/22907). The default timeout is 1 minute, which might not be sufficient in some environments. You could try increasing the
INIT_RETRY_TIMEOUT_MINUTES
if this is the case. 2. Resource constraints: If your Kubernetes cluster is under-resourced, it might cause delays in pod creation and initialization. This could lead to timeouts. Ensure that your cluster has sufficient resources to handle the workload. 3. Pod sweeper frequency: While the pod sweeper shouldn't prevent jobs from starting, if it's running too frequently, it might be consuming resources needed for job execution. You mentioned that the sweeping cycles are running continuously, which is not the intended behavior. The pod sweeper is typically configured to run every 2 hours, as mentioned in the [GitHub issue #38187](https://github.com/airbytehq/airbyte/issues/38187). To address these issues, you could try the following: 1. Check your Airbyte configuration, particularly the pod sweeper settings. Ensure it's not set to run too frequently. 2. Increase the
INIT_RETRY_TIMEOUT_MINUTES
as suggested in the [GitHub issue #22907](https://github.com/airbytehq/airbyte/issues/22907). 3. Review your Kubernetes cluster resources and scaling settings to ensure they can handle the Airbyte workload. 4. Check the full logs of a failing sync job to see if there are more specific error messages that could point to the root cause. If these steps don't resolve the issue, you may want to open a new issue on the [Airbyte GitHub repository](https://github.com/airbytehq/airbyte) with detailed logs and your configuration settings. The continuous running of pod sweeping cycles and failure of sync jobs to start is not normal behavior and may require further investigation by the Airbyte team.