Hi I have a question regarding the kubernetes operator I hav Apache Flink #troubleshooting

Hi, I have a question regarding the kubernetes ope...

Rohan Kumar

04/03/2023, 6:36 AM

Hi, I have a question regarding the kubernetes operator. I have added liveness probe in jobmanager and taskmanager podtemplates. But, even when the liveness probe fails the pod don’t restart. In the jobmanager pod events I can see

Warning  Unhealthy       51s (x139 over 35m)    kubelet  Liveness probe errored: rpc error: code = Unknown desc = deadline exceeded ("DeadlineExceeded"): context deadline exceeded

. Other pods with the same liveness probe restart when they fail. How to make it work?

Gyula Fóra

04/03/2023, 6:38 AM

Can you please share your podtemplate?

Rohan Kumar

04/03/2023, 6:41 AM

Copy code

jobManager:
    podTemplate:
      apiVersion: v1
      kind: Pod
      metadata:
        labels: {}
      spec:
        containers:
        - env:
          - name: KAFKA_ADDRESS
            value: 192.168.1.6:9096
          image: image
          livenessProbe:
            exec:
              command:
              - java
              - -cp
              - /opt/health-status-assembly-0.1-SNAPSHOT.jar:/opt/flink/lib/*
              - CheckHealth
            failureThreshold: 1
            initialDelaySeconds: 300
            periodSeconds: 15
            timeoutSeconds: 5
          name: flink-main-container
        initContainers:
        - args:
          - tar
          - -xvf
          - /cm/hadoop-conf/hadoop-conf.tar
          - -C
          - /opt/hadoop/conf
          image: ubuntu
          name: hadoop-conf
          volumeMounts:
          - mountPath: /cm/hadoop-conf
            name: hadoop-conf-cm
          - mountPath: /opt/hadoop/conf
            name: hadoop-conf
        volumes:
        - configMap:
            name: hadoop-conf
          name: hadoop-conf-cm
        - emptyDir: {}
          name: hadoop-conf

Thanks for the reply @Gyula Fóra

Gyula Fóra

04/03/2023, 6:47 AM

Hm, I am not sure why it is not restarted. By the way, why are you trying to add a liveness probe? Flink already has mechanisms to detect and shut down lost TMs through heartbeats

Rohan Kumar

04/03/2023, 7:00 AM

I need to add liveness probe because of the hdfs filesystem we are using has some issues. In that filesystem client there is no timeout and after sometime when the connection to the fs expires, the client just keeps trying to connect indefinitely. When this issue occurs I can’t submit jobs to the session cluster. I think this issue doesn’t occur with already started jobs in application clusters. In the health checker code I am just trying to check if I can connect to the filesystem with a timeout. So, once it fails to connect the pod restarts and creates a new connection to the filesystem.

Gyula Fóra

04/03/2023, 7:02 AM

understood, makes sense

Gyula Fóra

04/03/2023, 7:02 AM

unfortunately I don’t know the cause of your restart problem

Rohan Kumar

04/03/2023, 7:05 AM

Okay. Actually when I added the livenessprobe yesterday, k8s restarted the pods of the session cluster. But, today for some reason that is not happening even when the probe has failed for more than 100 times.

Open in Slack

Previous Next