Hi all when I set the taskManager cpu number to more than 1 Apache Flink #troubleshooting

Hi all, when I set the taskManager cpu number to m...

Bjarke Tornager

08/02/2024, 11:38 AM

Hi all, when I set the taskManager cpu number to more than 1, in a stateful application that uses keyed coprocessing, I get state restore failures when the job restarts to adjust the parallelism. However, when setting the taskManager cpu to 1 I do not get this issue. Any configuration I can do to avoid this problem?

D. Draco O'Brien

08/03/2024, 5:29 AM

Well you should make sure that you are using state mgmt that supports incremental checkpoints/savepoints. Like RocksDB

D. Draco O'Brien

08/03/2024, 5:32 AM

checkpoint restoration needs frequent and reliable checkpoints/savepoints

D. Draco O'Brien

08/03/2024, 5:35 AM

you can use stateRestartStrategy to try FixedDelayRestart or FallbackRestartsStrategy which can give you more control over job restarts

D. Draco O'Brien

08/03/2024, 5:37 AM

you might try setting numberOfKeyGroups based on your parallelism this affects how state is shared across tasks

D. Draco O'Brien

08/03/2024, 5:40 AM

Another setting you can look at is CPU core allocation. ie if your machine has 8 cores and taskmanager.numberOfTaskSlots to 4 you want to ensure that each slot is configured to use approximately 2 cores. taskManager.cpuCores is available on some versions of Flink

D. Draco O'Brien

08/03/2024, 5:42 AM

In addition to these configs, you want to configure enough memory & set log level to DEBUG to try to see causes of state restoration failures

D. Draco O'Brien

08/03/2024, 5:43 AM

Let us know if any of these configurations resolve the issue.

Bjarke Tornager

08/04/2024, 11:52 AM

@D. Draco O'Brien thanks I will try your suggestions. I am using RocksDB as the backend (

state.backend: rocksdb

). But it seems like the state size might be affecting this. What I have noticed is that with low state size around 10Gb I don't seem to have issues but as state grows to around 60Gb I keep getting issues. I am using SSDs so it is a bit puzzling to me. For the local state I use the

state.backend.rocksdb.localdir: /mnt/flink

config but of course my checkpoints are using durable storage - Adls Gen2.

D. Draco O'Brien

08/04/2024, 1:28 PM

If you are on K8s StatefulSets might be more stable https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/

Bjarke Tornager

08/05/2024, 7:23 AM

I am using the apache flink operator for the deployment and operation of my job so I am not sure that would help me much. Can you explain a bit more about how it might help me?

D. Draco O'Brien

08/05/2024, 8:29 AM

I am not sure if Operator can use StatefulSets. You can take a look at this thread https://www.mail-archive.com/user@flink.apache.org/msg51675.html

👀 1

D. Draco O'Brien

08/05/2024, 8:30 AM

You can read about the advantages here: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/

👀 1

D. Draco O'Brien

08/05/2024, 8:38 AM

I think the FlinkSessionJobs or FlinkDeployments within Flink Operator can internally create and manage StatefulSets for JobManager and TaskManager pods. This might give the pods in the StatefulSet a unique identity, and more stable storage.

2 Views

Open in Slack

Previous Next