Hi all, when I set the taskManager cpu number to m...
# troubleshooting
b
Hi all, when I set the taskManager cpu number to more than 1, in a stateful application that uses keyed coprocessing, I get state restore failures when the job restarts to adjust the parallelism. However, when setting the taskManager cpu to 1 I do not get this issue. Any configuration I can do to avoid this problem?
d
Well you should make sure that you are using state mgmt that supports incremental checkpoints/savepoints. Like RocksDB
checkpoint restoration needs frequent and reliable checkpoints/savepoints
you can use stateRestartStrategy to try FixedDelayRestart or FallbackRestartsStrategy which can give you more control over job restarts
you might try setting numberOfKeyGroups based on your parallelism this affects how state is shared across tasks
Another setting you can look at is CPU core allocation. ie if your machine has 8 cores and taskmanager.numberOfTaskSlots to 4 you want to ensure that each slot is configured to use approximately 2 cores. taskManager.cpuCores is available on some versions of Flink
In addition to these configs, you want to configure enough memory & set log level to DEBUG to try to see causes of state restoration failures
Let us know if any of these configurations resolve the issue.
b
@D. Draco O'Brien thanks I will try your suggestions. I am using RocksDB as the backend (
state.backend: rocksdb
). But it seems like the state size might be affecting this. What I have noticed is that with low state size around 10Gb I don't seem to have issues but as state grows to around 60Gb I keep getting issues. I am using SSDs so it is a bit puzzling to me. For the local state I use the
state.backend.rocksdb.localdir: /mnt/flink
config but of course my checkpoints are using durable storage - Adls Gen2.
d
If you are on K8s StatefulSets might be more stable https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
b
I am using the apache flink operator for the deployment and operation of my job so I am not sure that would help me much. Can you explain a bit more about how it might help me?
d
I am not sure if Operator can use StatefulSets. You can take a look at this thread https://www.mail-archive.com/user@flink.apache.org/msg51675.html
πŸ‘€ 1
πŸ‘€ 1
I think the FlinkSessionJobs or FlinkDeployments within Flink Operator can internally create and manage StatefulSets for JobManager and TaskManager pods. This might give the pods in the StatefulSet a unique identity, and more stable storage.