Hi, I have a question regarding the kubernetes ope...
# troubleshooting
r
Hi, I have a question regarding the kubernetes operator. I have added liveness probe in jobmanager and taskmanager podtemplates. But, even when the liveness probe fails the pod don’t restart. In the jobmanager pod events I can see
Warning  Unhealthy       51s (x139 over 35m)    kubelet  Liveness probe errored: rpc error: code = Unknown desc = deadline exceeded ("DeadlineExceeded"): context deadline exceeded
. Other pods with the same liveness probe restart when they fail. How to make it work?
g
Can you please share your podtemplate?
r
Copy code
jobManager:
    podTemplate:
      apiVersion: v1
      kind: Pod
      metadata:
        labels: {}
      spec:
        containers:
        - env:
          - name: KAFKA_ADDRESS
            value: 192.168.1.6:9096
          image: image
          livenessProbe:
            exec:
              command:
              - java
              - -cp
              - /opt/health-status-assembly-0.1-SNAPSHOT.jar:/opt/flink/lib/*
              - CheckHealth
            failureThreshold: 1
            initialDelaySeconds: 300
            periodSeconds: 15
            timeoutSeconds: 5
          name: flink-main-container
        initContainers:
        - args:
          - tar
          - -xvf
          - /cm/hadoop-conf/hadoop-conf.tar
          - -C
          - /opt/hadoop/conf
          image: ubuntu
          name: hadoop-conf
          volumeMounts:
          - mountPath: /cm/hadoop-conf
            name: hadoop-conf-cm
          - mountPath: /opt/hadoop/conf
            name: hadoop-conf
        volumes:
        - configMap:
            name: hadoop-conf
          name: hadoop-conf-cm
        - emptyDir: {}
          name: hadoop-conf
Thanks for the reply @Gyula Fóra
g
Hm, I am not sure why it is not restarted. By the way, why are you trying to add a liveness probe? Flink already has mechanisms to detect and shut down lost TMs through heartbeats
r
I need to add liveness probe because of the hdfs filesystem we are using has some issues. In that filesystem client there is no timeout and after sometime when the connection to the fs expires, the client just keeps trying to connect indefinitely. When this issue occurs I can’t submit jobs to the session cluster. I think this issue doesn’t occur with already started jobs in application clusters. In the health checker code I am just trying to check if I can connect to the filesystem with a timeout. So, once it fails to connect the pod restarts and creates a new connection to the filesystem.
g
understood, makes sense
unfortunately I don’t know the cause of your restart problem
r
Okay. Actually when I added the livenessprobe yesterday, k8s restarted the pods of the session cluster. But, today for some reason that is not happening even when the probe has failed for more than 100 times.