Hi All I'm having issues with checkpoints on k8s a...
# troubleshooting
e
Hi All I'm having issues with checkpoints on k8s and Flink operator, Currently using Flink operator 1.7 with Flink 1.18.1 The issue is the following, and I cannot find yet the reason. sometimes my jobmanager is killed and restored and a new pod is created, but the checkpoint is not restored so the application start with a clean state which is not good eventhough the configuration is set to last-state. when I check the logs I see on it Found 0 checkpoints
Copy code
2024-07-25 07:29:54,586 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Initializing job 'pot' (bfadda0ae94f23951224bb498106d4cf).
2024-07-25 07:29:54,604 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using restart back off time strategy FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=2147483647, backoffTimeMS=1000) for pot (bfadda0ae94f23951224bb498106d4cf).
2024-07-25 07:29:54,785 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - Recovering checkpoints from KubernetesStateHandleStore{configMapName='pot-flink-deployment-bfadda0ae94f23951224bb498106d4cf-config-map'}.
2024-07-25 07:29:54,793 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - Found 0 checkpoints in KubernetesStateHandleStore{configMapName='pot-flink-deployment-bfadda0ae94f23951224bb498106d4cf-config-map'}.
2024-07-25 07:29:54,793 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - Trying to fetch 0 checkpoints from storage.
2024-07-25 07:29:54,892 INFO  org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapSharedInformer [] - Starting to watch for pot/
and the fun part is sometimes it works, A few times it finds the checkpoint and restores correctly. if I check the configmap
pot-flink-deployment-bfadda0ae94f23951224bb498106d4cf-config-map
the checkpoint information is correct there and the folder is correctly stored in the gcp bucket hope anyone can guide me to solve this problem