Hello, I have a trouble with checkpoints. Usually ...
# troubleshooting
d
Hello, I have a trouble with checkpoints. Usually checkpoint takes less than 15s, but after sometime it cant do a checkpoint because of timeout (30 min). As I can see checkpoint cant be performed in one subtask in a KeyedCoProcessFunction with paralellism 32, for other subtasks checkpoint is done in normal time (less than 15s). What might be a cause? I tried to enable Unaligned Checkpoints, but it doesn't work.
h
Hi, it would be useful to use the Flink dashboard to debug here. Maybe try taking a look at the FlameGraphs to understand where the time is being spent. Alternatively, you can take thread dumps of the TMs
d
@Demid P It’s hard to know without more information, but the behavior you’re describing can be caused by high skew. You’re using a keyed function, so if one key suddenly gets a very large amount of data, a single task of the 32 gets stuck processing these high volume keys. The reason this affects checkpointing is because Flink sends a special checkpoint barrier event that propagates through operators downstream. If one of the 32 tasks has a very large amount of pending data to process, it has to get through all of that before the checkpoint barrier is received. That delay adds to the overall time a checkpoint takes.
@Demid P If you’re able to view the Flink UI around the time the checkpointing gets long, the sub-tasks for that operator should show large differences in the amount of rows input. If you don’t see much skew, that would rule it out. If you do see it, it suggests skew could be the problem (but isn’t definitive unless the skew is extremely large - think >10:1). If you do see significant skew, a few options to consider: • If you don’t need exactly-once semantics, you could switch your checkpointing to
AT_LEAST_ONCE
. According to the docs, “checkpoints will not block any channels with barriers during alignment” in this mode, unlike the default
EXACTLY_ONCE
checkpointing setting. This won’t speed up checkpointing, but it could help mitigate the application being temporarily blocked for too long, if that matters for your use case. • Tackle the skew. There’s a great Flink Forward 2022 talk on this: https://www.slideshare.net/FlinkForward/evening-out-the-uneven-dealing-with-skew-in-flink-252485368
d
Lots of timers firing at the same time? This is currently the biggest limitation that could block checkpoints that I’m aware of.
1