Sumit Nekar

05/25/2023, 10:55 AM
Hi , I am experimenting with flink operator s feature on restarting unhealthy job, with following configs.
# Restart of unhealthy job deployments by flink-kubernetes-operator
# <> true 2 15 min
kubernetes.operator.job.restart.failed: true
As per the above configs, if the job has restarted more than 2 times, flink operator is redeploying the job with available state. That means in application mode, new JM pod is coming up after the running job has restarted 2 times within 15 min . Need some clarifications on this. 1. In case job is not able to recover at all, will flink operator continue restarting forever? Is there any threshold count after which flink operator gives up this activity? 2. How the restartegy-startegy configured at JM level is honoured by flink operator in this case? 3. flink operator referes “flink_jobmanager_job_numRestarts” metric to decide if restartCount threshold is breached but everytime operator redeploys the job , new JM comes and flink_jobmanager_job_numRestarts value starts from 0 again and flink operator continues to redeploy after every 2 numRestarts . In this case, when the job will be marked as failed?

Gyula Fóra

05/26/2023, 8:52 AM
The operator health check restarts are designed to catch unrecoverable errors that a full redeploy / restart may solve. This is completely independent from the restart strategy
Generally the restart strategy keeps restarting it indefinitely. The operator may catch this and perform a full cluster restart
So when the operator restart the job based on this optional setting the JM restart strategy will be reset