wave Hello community Probably it can ring a bell for somebo Apache Flink #troubleshooting

:wave: Hello community. Probably it can ring a bel...

Alex Nitavsky

08/08/2024, 8:43 AM

👋 Hello community. Probably it can ring a bell for somebody. We have a deployment with ~600 TMs for Flink in Application mode. During the upgrade using Flink Apache K8s operator, the operator tries to make the savepoint. From time to time it fails on the check and raises an exception:

org.apache.flink.runtime.checkpoint.FinishedTaskStateProvider$PartialFinishingNotSupportedByStateException: The vertex <_VERTEX_NAME_>) (id = 09d76bde52b3fbb1988a52ca0243c5b0) has used UnionListState, but part of its tasks has called operators' finish method.

Does anybody recall hitting a similar issue? Thanks

D. Draco O'Brien

08/08/2024, 8:49 AM

The error message you’re encountering: FinishedTaskStateProvider$PartialFinishingNotSupportedByStateException, typically indicates something occuring during the checkpointing process where a state backend operation (specifically, the use of a UnionListState) is incompatible with the current status of tasks in your Flink job.

D. Draco O'Brien

08/08/2024, 8:51 AM

This usually happens when a task has already started shutting down probably triggered by a finish call on its operators, but there’s an attempt to take a savepoint that includes a UnionListState. Flink’s UnionListState is designed to aggregate state across parallel instances of an operator, and it requires all parallel instances to be active to function correctly during a savepoint. If any parallel instance has finished execution (or is in the process of finishing), the savepoint cannot fully represent the unified state, and then you get the exception.

D. Draco O'Brien

08/08/2024, 8:52 AM

So what to do about it …

D. Draco O'Brien

08/08/2024, 8:53 AM

Well the first thing is to look into task shutdown behavior and make sure that tasks are only finishing as planned and not due to any unexpected errors. For this of course you want to look at Flink logs and have the right logging levels set with a focus on the tasks any any unusual errors or task shutdown behaviors.

D. Draco O'Brien

08/08/2024, 8:57 AM

You might also take a look at your savepoint vs. checkpoint triggering strategy. If upgrades frequently cause this issue you might consider a manual savepoint before starting the upgrade process to ensure all tasks are in a steady state.

D. Draco O'Brien

08/08/2024, 8:58 AM

I don’t know if this is a known Flink issue or not you are encountering but you might check issues list as well.

D. Draco O'Brien

08/08/2024, 8:59 AM

you might need to adjust checkpointing parameters to accommodate your workload and cluster size as well.

D. Draco O'Brien

08/08/2024, 9:10 AM

There is one more thing you need to check and that’s Operator Coordination and ensuring graceful shutdown. There are several aspects to this. You want to inspect the code looking for usage of UnionStateList. If it’s are not explicitly used it might be happening implicitly due to certain window or aggregation functions. Keep in mind that the UnionStateList is typically used when you are combining the state of multiple parallel operator instances into a single view. So whether its implicit or explicit this is likely where the usage of UnionStateList is occurring.

D. Draco O'Brien

08/08/2024, 9:12 AM

To check for graceful shutdown you need to inspect your operator implementations for lifecycle callbacks like open(), close(), and cancel() methods. With cancel() make sure that it gives the operator a chance to gracefully clean up resources and persist state as needed.

D. Draco O'Brien

08/08/2024, 9:14 AM

Also check for proper exception handling as failure to properly handle exceptions can lead to abrupt terminations instead of graceful shutdowns..

D. Draco O'Brien

08/08/2024, 9:15 AM

your check of logs should not be just for errors but also warnings related to task shutdowns to see if there are any patterns that might be related to the error.

D. Draco O'Brien

08/08/2024, 9:19 AM

Lastly there might be information in the Flink savepoints/checkpoints that cold give you more information. You can use

Copy code

flink savepoint

to list information about savepoint data.

D. Draco O'Brien

08/08/2024, 9:23 AM

The savepoint will contain a list of all operators and their states in the job graph. Look for opertors using UnionListState. The metadata should also contain parallelism data for each operator which could be affecting the error as well. You can also see state sizes which can pinpoint areas that might need optimization. In addition to CLI you might find some visual tools that can help in inspecting the savepoints/checkpoints.

D. Draco O'Brien

08/08/2024, 9:28 AM

Let us know if it resolves or if you obtained or can share additional information about your environment and configuration.

Alex Nitavsky

08/13/2024, 8:01 AM

Thanks Draco for a lot of great workarounds. You are really covered so many different cases, thanks a lot. The issue happens during the

stop-with-savepoint

call from the k8s operator. So we will be using checkpoints to perform releases at the end. Meanwhile it is really feels that something is really strange with

stop-with-savepoint

operator, since it allows operator to finish the before the savepoint is performed. Since in our specific deployment have 600 TM, 2000 parallelism and 100% CPU for JM, I really suspect some thread/process race issue. We will increase amount of CPU for JM and will see if we can see the same issue again.

2 Views

Open in Slack

Previous Next