Hi all we have multiple Flink session clusters and I was won Apache Flink #troubleshooting

Hi all, we have multiple Flink session clusters, a...

Guruguha Marur Sreenivasa

08/08/2024, 4:02 PM

Hi all, we have multiple Flink session clusters, and I was wondering why a Flink job would lose its state when the job gets redeployed? i.e., say there was a job by name "flink-job", if I cancel this job and re-submit it, the new job seems to be starting from fresh. Instead, should it not start from its previous checkpoint? I don't see any checkpoints restored. Can someone help me understand this?

D. Draco O'Brien

08/08/2024, 4:12 PM

When a Flink job is cancelled and then re-submitted, by default, it will indeed start from scratch unless explicitly configured to restore from a previous savepoint or checkpoint. The reason why is because checkpoints and savepoints are mechanisms in Flink for fault tolerance, but they are used differently during job redeployment. Checkpoints are periodic and automatic, designed primarily for recovery in case of failures. When you cancel a job and restart it, Flink does not automatically restore from the last checkpoint by default. Checkpoints are meant for automatic recovery within the lifecycle of a running job.

D. Draco O'Brien

08/08/2024, 4:13 PM

Savepoints on the other hand, are user-triggered and are typically used for planned maintenance activities, version upgrades, or changing the job’s parallelism. If you want a job to resume from where it left off after cancellation and redeployment, you should manually trigger a savepoint before cancelling the job, and then specify that savepoint when submitting the job again.

Guruguha Marur Sreenivasa

08/08/2024, 4:16 PM

When you cancel a job and restart it, Flink does not automatically restore from the last checkpoint by default.

Thanks Draco, what do I need to do to have the job restore from a checkpoint on resubmit?

Guruguha Marur Sreenivasa

08/08/2024, 4:18 PM

I don't want the job to lose its state Also, is it possible to enable savepoints on a session cluster job? I thought savepoints are only for application clusters.

D. Draco O'Brien

08/08/2024, 4:35 PM

Concerning savepoints on a session cluster, yes, it is absolutely possible and common practice to use savepoints with both session and per-job clusters in Flink. Savepoints are a versatile feature designed for maintaining the state of applications across upgrades, configuration changes, or even switching between session and per-job clusters. They are not limited to a specific type of cluster deployment.

Guruguha Marur Sreenivasa

08/08/2024, 4:37 PM

OK, could you please point me to some documentation on how I can use checkpoints and savepoints to restore state on job resubmit?

D. Draco O'Brien

08/08/2024, 4:40 PM

To clarify, if you cancel a Flink job and then want to resume from where it left off, you must use savepoints, not regular checkpoints, due to the nature of their intended use cases—checkpoints for automatic recovery within a continuous job run, and savepoints for planned maintenance or state migrations.

D. Draco O'Brien

08/08/2024, 4:42 PM

Please review for your version of Flink: https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/ops/state/checkpoints_vs_savepoints/

D. Draco O'Brien

08/08/2024, 4:42 PM

And https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/ops/state/savepoints/ but these are specifically for Flink 1.19

D. Draco O'Brien

08/08/2024, 4:43 PM

If you have additional questions about configuration post those as separate questions under #C03G7LJTS2G

Guruguha Marur Sreenivasa

08/08/2024, 4:44 PM

But restoring from savepoints causes a lot of duplicate data to be reprocessed, right? As in we usually take a savepoint every hour or so. By restoring from a savepoint, it'll be as though we went bck an hour to process data. We take checkpoints every 30s, this would have been much better if we were to restart from a checkpoint

D. Draco O'Brien

08/08/2024, 4:45 PM

How are you cancelling?

D. Draco O'Brien

08/08/2024, 4:45 PM

if you are not frequently cancelling it’s basically an savepoint prior to that operation.

D. Draco O'Brien

08/08/2024, 4:47 PM

I think the policy is that if you cancel a job as part of operations as opposed to some failure than you want to first do the savepoint. This is followed by the cancellation operation.

✅ 1

Guruguha Marur Sreenivasa

08/08/2024, 4:47 PM

Using flink cancel API

Guruguha Marur Sreenivasa

08/08/2024, 4:47 PM

Makes sense.

D. Draco O'Brien

08/08/2024, 4:48 PM

I think frequent checkpointing is a good idea actual failures errors etc. Unplanned interruptions. But for a planned cancellations savepoints seem to be what’s recommended by the documentation

✅ 1

Guruguha Marur Sreenivasa

08/08/2024, 4:49 PM

Thank you!

D. Draco O'Brien

08/08/2024, 4:50 PM

Good luck on it!

gratitude thank you 1

Guruguha Marur Sreenivasa

08/08/2024, 4:52 PM

One last question: if a job restarts by itself, say due to some EKS pod shuffling that caused a task manager to be taken out and reallocated elsewhere, I'm assuming the job would restart and restore from a previous checkpoint, right?

D. Draco O'Brien

08/08/2024, 5:11 PM

Yes, you are correct. If a Flink job is set up with checkpointing enabled and it experiences a failure or restart due to events like an Elastic Kubernetes Service (EKS) pod rescheduling, Flink is designed to automatically restore from the most recent successful checkpoint. This is a core aspect of Flink’s fault-tolerance mechanism.

✅ 1

D. Draco O'Brien

08/08/2024, 5:11 PM

It’s important to ensure that your checkpointing interval is set such that you’re not losing too much progress in case of a restart. Additionally, having reliable storage for checkpoints (like HDFS, S3, or another distributed file system) is crucial to make sure the checkpoints are accessible when needed for recovery.

✅ 1

Guruguha Marur Sreenivasa

08/08/2024, 5:16 PM

great, thank you

Open in Slack

Previous Next