Sumit Nekar
05/25/2023, 10:55 AM# Restart of unhealthy job deployments by flink-kubernetes-operator
# <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.3/docs/custom-resource/job-management/#restart-of-unhealthy-job-deployments>
kubernetes.operator.cluster.health-check.enabled: true
kubernetes.operator.cluster.health-check.restarts.threshold: 2
kubernetes.operator.cluster.health-check.restarts.window: 15 min
kubernetes.operator.job.restart.failed: true
As per the above configs, if the job has restarted more than 2 times, flink operator is redeploying the job with available state. That means in application mode, new JM pod is coming up after the running job has restarted 2 times within 15 min . Need some clarifications on this.
1. In case job is not able to recover at all, will flink operator continue restarting forever? Is there any threshold count after which flink operator gives up this activity?
2. How the restartegy-startegy configured at JM level is honoured by flink operator in this case?
3. flink operator referes “flink_jobmanager_job_numRestarts” metric to decide if restartCount threshold is breached but everytime operator redeploys the job , new JM comes and flink_jobmanager_job_numRestarts value starts from 0 again and flink operator continues to redeploy after every 2 numRestarts . In this case, when the job will be marked as failed?Gyula Fóra
05/26/2023, 8:52 AM