Hello, We have deployed a flink job in application...
# troubleshooting
s
Hello, We have deployed a flink job in application mode using flink operator. Sometimes we are seeing TM pods get terminated (error state) after multiple job restarts. I suspect TM pods may get terminated once the job fails. Can anyone help me to understand what action flink operator takes when the job is failed or restart strategy threshold is met.
g
The operator can also restart terminally failed jobs if configured to do so. I don't have it in front of me but something like kubernetes.operator.job.restart-failed
s
Yes its kubernetes.operator.job.restart.failed: true. I mentioned in my second question that you replied. The task managers are going to ERROR state and the logs at that moment say leadership was lost in the JM. Flink operator logs below.
Copy code
2023-05-26 05:23:24,039 o.a.f.k.o.l.AuditUtils         [INFO ][flink-pipeline-example/pipeline-event-dedup] >>> Event  | Warning | CLUSTERDEPLOYMENTEXCEPTION | Operation: [get]  for kind: [Event]  with name: [JobManagerDeployment.1416348292]  in namespace: [flink-pipeline-example]  failed.
2023-05-26 05:23:24,039 o.a.f.k.o.r.ReconciliationUtils [WARN ][flink-pipeline-example/pipeline-event-dedup] Attempt count: 0, last attempt: false
2023-05-26 05:23:24,089 o.a.f.k.o.l.AuditUtils         [INFO ][flink-pipeline-example/pipeline-event-dedup] >>> Status | Error   | STABLE          | {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Event]  with name: [JobManagerDeployment.1416348292]  in namespace: [flink-pipeline-example]  failed.","throwableList":[{"type":"io.fabric8.kubernetes.client.KubernetesClientException","message":"Operation: [get]  for kind: [Event]  with name: [JobManagerDeployment.1416348292]  in namespace: [flink-pipeline-example]  failed."},{"type":"java.io.IOException","message":"Read timed out"}]}
2023-05-26 05:23:24,089 i.j.o.p.e.ReconciliationDispatcher [ERROR][flink-pipeline-example/pipeline-event-dedup] Error during event processing ExecutionScope{ resource id: ResourceID{name='pipeline-event-dedup', namespace='flink-pipeline-example'}, version: 88813125} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Event]  with name: [JobManagerDeployment.1416348292]  in namespace: [flink-pipeline-example]  failed.
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Event]  with name: [JobManagerDeployment.1416348292]  in namespace: [flink-pipeline-example]  failed.
2023-05-26 05:23:29,090 o.a.f.k.o.c.FlinkDeploymentController [INFO ][flink-pipeline-example/pipeline-event-dedup] Starting reconciliation