Apache Flink

When running a Flink job with checkpoints, sometimes a node just refuses to send the acknowledgement and the checkpoint stalls for a long time. Does anyone know any possible causes?

Have you checked resources usage levels? memory availability etc? network etc? Are you using incremental checkpointing and what is the interval?

1. Check network, firewalls etc. network issues could easily cause this. Is other traffic getting through?

2. Resource constraints, CPU/Memory Disk IO

3. Check health of storage medium used for checkpointing.

All this info comes from various sources logging, prometheus, Flink UI metrics etc.  So you should familiarize yourself as much as possible with this.