Hello All, I am using flink kubernetes operator wi...
# troubleshooting
c
Hello All, I am using flink kubernetes operator with FlinkDeployment CRD to spin up the job and task manager pods. Everything worked fine for quiet some time and suddenly task manager pod restarted and then remained in ContainerCreating state and remained there forever. It neither give any error on why it remained in container creating state. As a temp work around, I deleted whole namespace and then it started working. Any idea what could be issue? Thanks in advance
d
To troubleshoot further: Run
Copy code
kubectl describe pod <pod-name> to get more details about the pod's status and events.
Check the Kubernetes cluster’s control plane logs, especially kubelet logs on the node where the pod was scheduled to run. These logs often contain more specific error messages. Inspect the events in the namespace with
Copy code
kubectl get events --all-namespaces.
If you’re using a managed Kubernetes service, check the cloud provider’s console for any related alerts or issues. If the issue recurs, try setting up more comprehensive logging and monitoring to catch these issues early and diagnose them. (edited)
c
Ok let me see kubelet logs because pod event does not show any errors
Thanks for suggestion
d
Yes, I would look through the logs. Basically there are a few things that could cause this. One is certainly lack of resources CPU/Disk etc. can cause this.
I have usually seen this error with an image pull issue
this can be be either network connectivity or access to docker image etc.
c
I checked resources for the node. There is plenty of resources available. And regarding image pull, other application pods are running fine which is pulling images from same source and having image size larger than flink pods.
I have asked for kubelet logs to check further