We want to do the following and simply can’t find ...
# troubleshooting
n
We want to do the following and simply can’t find any standard process for doing so: • Migrate to a new kubernetes cluster • Use the last checkpoint (Kafka source) to continue from when creating a new job (with no duplicates) This seems very straightforward and also seems to be provisioned for when looking at documentation (connectors/datastream/kafka/#starting-offset) . However, the following strategies don’t work: • Suspending the job removes the contents in the
high-availability
directory (which is where the reference to the last checkpoint is) • doing
kubectl delete flinkdeployment
also deletes the contents in the
high-availability
directory • doing
kubectl delete deployment
leads to the Flink operator recreating the job/task pods which means they’ll run in both kubernetes clusters and lead to duplicates How is one supposed to do this? It seems like an extremely common requirement. There is a very laborious process one can do by suspending with savepoint, copying last savepoint, redeploying in the new job with a reference to the savepoint. This surely can’t be the standard way of doing this. Any help/advice would be appreciated
f
Hello Nicholas, Sounds like a configuration issue Have you followed the Kubernetes HA guide? https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/ha/kubernetes_ha/
n
Hi @Flaviu Cicio, yes we have
Copy code
high-availability.type: kubernetes
and
Copy code
high-availability.storageDir: <s3://flink/recovery>
The issue is that this directory in S3 is cleared out every time we do a
suspend
. So we haven’t been able to find a way of stopping the Job while leaving the contents in the high-availability directory. Hope that makes sense