Hi I am experimenting with flink operator s feature on resta Apache Flink #troubleshooting

Hi , I am experimenting with flink operator s feat...

Sumit Nekar

05/25/2023, 10:55 AM

Hi , I am experimenting with flink operator s feature on restarting unhealthy job, with following configs.

Copy code

# Restart of unhealthy job deployments by flink-kubernetes-operator
# <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.3/docs/custom-resource/job-management/#restart-of-unhealthy-job-deployments>
kubernetes.operator.cluster.health-check.enabled: true
kubernetes.operator.cluster.health-check.restarts.threshold: 2
kubernetes.operator.cluster.health-check.restarts.window: 15 min
kubernetes.operator.job.restart.failed: true

As per the above configs, if the job has restarted more than 2 times, flink operator is redeploying the job with available state. That means in application mode, new JM pod is coming up after the running job has restarted 2 times within 15 min . Need some clarifications on this. 1. In case job is not able to recover at all, will flink operator continue restarting forever? Is there any threshold count after which flink operator gives up this activity? 2. How the restartegy-startegy configured at JM level is honoured by flink operator in this case? 3. flink operator referes “flink_jobmanager_job_numRestarts” metric to decide if restartCount threshold is breached but everytime operator redeploys the job , new JM comes and flink_jobmanager_job_numRestarts value starts from 0 again and flink operator continues to redeploy after every 2 numRestarts . In this case, when the job will be marked as failed?

Gyula Fóra

05/26/2023, 8:52 AM

The operator health check restarts are designed to catch unrecoverable errors that a full redeploy / restart may solve. This is completely independent from the restart strategy

Gyula Fóra

05/26/2023, 8:53 AM

Generally the restart strategy keeps restarting it indefinitely. The operator may catch this and perform a full cluster restart

Gyula Fóra

05/26/2023, 8:56 AM

So when the operator restart the job based on this optional setting the JM restart strategy will be reset

3 Views

Open in Slack

Previous Next