:wave: Our cluster default is unaligned checkpoint...
# troubleshooting
d
đź‘‹ Our cluster default is unaligned checkpointing we have a job that sometimes restarts due to failing checkpoints, and when we look at those checkpoints, they are aligned, after few failures/restarts a unaligned checkpoint is created, and the job resumes working, (job is reading from kafka writing to DB, has backpressure) We are unsure why aligned/unaligned switch happens, did anyone notice a similar behavior or can explain?
d
Could you share more information, specifically the configurations for the Flink app?
d
Thank you! The job manager config? Or the job itself?
Copy code
Here is the job manager config

Configurations:

blob.server.port	40925
env.java.opts.all	-XX:ActiveProcessorCount=2
execution.checkpointing.externalized-checkpoint-retention	DELETE_ON_CANCELLATION
execution.checkpointing.interval	1min
execution.checkpointing.max-concurrent-checkpoints	1
execution.checkpointing.min-pause	0
execution.checkpointing.mode	EXACTLY_ONCE
execution.checkpointing.timeout	5min
execution.checkpointing.tolerable-failed-checkpoints	0
execution.checkpointing.unaligned.enabled	true
high-availability.cluster-id	/flink-staging
high-availability.jobmanager.port	27927
high-availability.storageDir	-----
high-availability.type	zookeeper
high-availability.zookeeper.quorum	zk-1.******:2181,zk-2.******:2181,zk-3.******:2181
io.tmp.dirs	/tmp
jobmanager.archive.fs.dir	<s3://flink-fsn1/completed-jobs>
jobmanager.memory.flink.size	7g
jobmanager.memory.heap.size	7381975040b
jobmanager.memory.jvm-metaspace.size	536870912b
jobmanager.memory.jvm-overhead.max	894784868b
jobmanager.memory.jvm-overhead.min	894784868b
jobmanager.memory.off-heap.size	134217728b
jobmanager.rpc.address	<http://xx.xx.xx.xxx|xx.xx.xx.xxx>
jobmanager.rpc.port	27927
metrics.reporter.prom.factory.class	org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.filterLabelValueCharacters	true
metrics.reporter.prom.port	27666
metrics.reporters	prom
process.working-dir	/local
rest.bind-address	<http://xx.xx.xx.xxx|xx.xx.xx.xxx>
rest.bind-port	20967
rest.flamegraph.enabled	true
restart-strategy.exponential-delay.backoff-multiplier	2.0
restart-strategy.exponential-delay.initial-backoff	1sec
restart-strategy.exponential-delay.jitter-factor	0.1
restart-strategy.exponential-delay.max-backoff	1min
restart-strategy.exponential-delay.reset-backoff-threshold	5min
restart-strategy.type	exponential-delay
s3.access-key	******
s3.endpoint	******
s3.secret-key	******
state.backend.incremental	false
state.backend.local-recovery	true
state.backend.type	filesystem
state.checkpoints.dir	<s3://flink-fsn1/checkpoints>
state.savepoints.dir	<s3://flink-fsn1/savepoints>
web.cancel.enable	false
web.submit.enable	true

JVM
version	OpenJDK 64-Bit Server VM - Red Hat, Inc. - 11/11.0.20+8-LTS
arch	amd64
options	-Xmx7381975040
-Xms7381975040
-XX:MaxMetaspaceSize=536870912
-XX:ActiveProcessorCount=2
-Dlog.file=/alloc/flink/log/flink--standalonesession-0-app-1.log
-Dlog4j.configuration=file:/secrets/flink//log4j-console.properties
-Dlog4j.configurationFile=file:/secrets/flink//log4j-console.properties
-Dlogback.configurationFile=file:/secrets/flink//logback-console.xml
Any idea? also worth noting that when checkpoints are in_progress the UI shows them as aligned but then when the same checkpoint is completed it then shows as unaligned
The problem persists, and we don’t have any leads, so we would appreciate any help — To add to the weirdness, we noticed that during checkpointing, some checkpoints show as aligned when they are “in progress,” but then when they are completed, they show as unaligned (in the UI)