We want to do the following and simply can t find any standa Apache Flink #troubleshooting

We want to do the following and simply can’t find ...

Nicholas Erasmus

09/19/2023, 9:22 AM

We want to do the following and simply can’t find any standard process for doing so: • Migrate to a new kubernetes cluster • Use the last checkpoint (Kafka source) to continue from when creating a new job (with no duplicates) This seems very straightforward and also seems to be provisioned for when looking at documentation (connectors/datastream/kafka/#starting-offset) . However, the following strategies don’t work: • Suspending the job removes the contents in the

high-availability

directory (which is where the reference to the last checkpoint is) • doing

kubectl delete flinkdeployment

also deletes the contents in the

high-availability

directory • doing

kubectl delete deployment

leads to the Flink operator recreating the job/task pods which means they’ll run in both kubernetes clusters and lead to duplicates How is one supposed to do this? It seems like an extremely common requirement. There is a very laborious process one can do by suspending with savepoint, copying last savepoint, redeploying in the new job with a reference to the savepoint. This surely can’t be the standard way of doing this. Any help/advice would be appreciated

Flaviu Cicio

09/19/2023, 10:43 AM

Hello Nicholas, Sounds like a configuration issue Have you followed the Kubernetes HA guide? https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/ha/kubernetes_ha/

Nicholas Erasmus

09/19/2023, 10:46 AM

Hi @Flaviu Cicio, yes we have

Copy code

high-availability.type: kubernetes

and

Copy code

high-availability.storageDir: <s3://flink/recovery>

The issue is that this directory in S3 is cleared out every time we do a

suspend

. So we haven’t been able to find a way of stopping the Job while leaving the contents in the high-availability directory. Hope that makes sense

Open in Slack

Previous Next