Hello We have deployed a flink job in application mode using Apache Flink #troubleshooting

Hello, We have deployed a flink job in application...

Sumit Nekar

05/25/2023, 7:03 AM

Hello, We have deployed a flink job in application mode using flink operator. Sometimes we are seeing TM pods get terminated (error state) after multiple job restarts. I suspect TM pods may get terminated once the job fails. Can anyone help me to understand what action flink operator takes when the job is failed or restart strategy threshold is met.

Gyula Fóra

05/26/2023, 8:58 AM

The operator can also restart terminally failed jobs if configured to do so. I don't have it in front of me but something like kubernetes.operator.job.restart-failed

Sumit Nekar

05/26/2023, 6:15 PM

Yes its kubernetes.operator.job.restart.failed: true. I mentioned in my second question that you replied. The task managers are going to ERROR state and the logs at that moment say leadership was lost in the JM. Flink operator logs below.

Copy code

2023-05-26 05:23:24,039 o.a.f.k.o.l.AuditUtils         [INFO ][flink-pipeline-example/pipeline-event-dedup] >>> Event  | Warning | CLUSTERDEPLOYMENTEXCEPTION | Operation: [get]  for kind: [Event]  with name: [JobManagerDeployment.1416348292]  in namespace: [flink-pipeline-example]  failed.
2023-05-26 05:23:24,039 o.a.f.k.o.r.ReconciliationUtils [WARN ][flink-pipeline-example/pipeline-event-dedup] Attempt count: 0, last attempt: false
2023-05-26 05:23:24,089 o.a.f.k.o.l.AuditUtils         [INFO ][flink-pipeline-example/pipeline-event-dedup] >>> Status | Error   | STABLE          | {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Event]  with name: [JobManagerDeployment.1416348292]  in namespace: [flink-pipeline-example]  failed.","throwableList":[{"type":"io.fabric8.kubernetes.client.KubernetesClientException","message":"Operation: [get]  for kind: [Event]  with name: [JobManagerDeployment.1416348292]  in namespace: [flink-pipeline-example]  failed."},{"type":"java.io.IOException","message":"Read timed out"}]}
2023-05-26 05:23:24,089 i.j.o.p.e.ReconciliationDispatcher [ERROR][flink-pipeline-example/pipeline-event-dedup] Error during event processing ExecutionScope{ resource id: ResourceID{name='pipeline-event-dedup', namespace='flink-pipeline-example'}, version: 88813125} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Event]  with name: [JobManagerDeployment.1416348292]  in namespace: [flink-pipeline-example]  failed.
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Event]  with name: [JobManagerDeployment.1416348292]  in namespace: [flink-pipeline-example]  failed.
2023-05-26 05:23:29,090 o.a.f.k.o.c.FlinkDeploymentController [INFO ][flink-pipeline-example/pipeline-event-dedup] Starting reconciliation

Open in Slack

Previous Next