Hi all, we have multiple Flink session clusters, a...
# troubleshooting
g
Hi all, we have multiple Flink session clusters, and I was wondering why a Flink job would lose its state when the job gets redeployed? i.e., say there was a job by name "flink-job", if I cancel this job and re-submit it, the new job seems to be starting from fresh. Instead, should it not start from its previous checkpoint? I don't see any checkpoints restored. Can someone help me understand this?
d
When a Flink job is cancelled and then re-submitted, by default, it will indeed start from scratch unless explicitly configured to restore from a previous savepoint or checkpoint. The reason why is because checkpoints and savepoints are mechanisms in Flink for fault tolerance, but they are used differently during job redeployment. Checkpoints are periodic and automatic, designed primarily for recovery in case of failures. When you cancel a job and restart it, Flink does not automatically restore from the last checkpoint by default. Checkpoints are meant for automatic recovery within the lifecycle of a running job.
Savepoints on the other hand, are user-triggered and are typically used for planned maintenance activities, version upgrades, or changing the job’s parallelism. If you want a job to resume from where it left off after cancellation and redeployment, you should manually trigger a savepoint before cancelling the job, and then specify that savepoint when submitting the job again.
g
When you cancel a job and restart it, Flink does not automatically restore from the last checkpoint by default.
Thanks Draco, what do I need to do to have the job restore from a checkpoint on resubmit?
I don't want the job to lose its state Also, is it possible to enable savepoints on a session cluster job? I thought savepoints are only for application clusters.
d
Concerning savepoints on a session cluster, yes, it is absolutely possible and common practice to use savepoints with both session and per-job clusters in Flink. Savepoints are a versatile feature designed for maintaining the state of applications across upgrades, configuration changes, or even switching between session and per-job clusters. They are not limited to a specific type of cluster deployment.
g
OK, could you please point me to some documentation on how I can use checkpoints and savepoints to restore state on job resubmit?
d
To clarify, if you cancel a Flink job and then want to resume from where it left off, you must use savepoints, not regular checkpoints, due to the nature of their intended use cases—checkpoints for automatic recovery within a continuous job run, and savepoints for planned maintenance or state migrations.
If you have additional questions about configuration post those as separate questions under #C03G7LJTS2G
g
But restoring from savepoints causes a lot of duplicate data to be reprocessed, right? As in we usually take a savepoint every hour or so. By restoring from a savepoint, it'll be as though we went bck an hour to process data. We take checkpoints every 30s, this would have been much better if we were to restart from a checkpoint
d
How are you cancelling?
if you are not frequently cancelling it’s basically an savepoint prior to that operation.
I think the policy is that if you cancel a job as part of operations as opposed to some failure than you want to first do the savepoint. This is followed by the cancellation operation.
1
g
Using flink cancel API
Makes sense.
d
I think frequent checkpointing is a good idea actual failures errors etc. Unplanned interruptions. But for a planned cancellations savepoints seem to be what’s recommended by the documentation
1
g
Thank you!
d
Good luck on it!
gratitude thank you 1
g
One last question: if a job restarts by itself, say due to some EKS pod shuffling that caused a task manager to be taken out and reallocated elsewhere, I'm assuming the job would restart and restore from a previous checkpoint, right?
d
Yes, you are correct. If a Flink job is set up with checkpointing enabled and it experiences a failure or restart due to events like an Elastic Kubernetes Service (EKS) pod rescheduling, Flink is designed to automatically restore from the most recent successful checkpoint. This is a core aspect of Flink’s fault-tolerance mechanism.
1
It’s important to ensure that your checkpointing interval is set such that you’re not losing too much progress in case of a restart. Additionally, having reliable storage for checkpoints (like HDFS, S3, or another distributed file system) is crucial to make sure the checkpoints are accessible when needed for recovery.
1
g
great, thank you