Hello ! We're running into an issue when deploying...
# troubleshooting
a
Hello ! We're running into an issue when deploying flink on k8s using the
flink-kubernetes-operator
Occasionally, when a JobManager gets rotated out (by karpenter in our case), the next JobManager is incapable of getting into a stable state and is stuck in a crash loop by a
DuplicateJobSubmissionException
We did increase the
terminationGracePeriodSeconds
but it doesn't seem to help. Is it expected that the operator isn't able get jobmanagers back into a stable state ? Perhaps we configured something wrong ? Thanks ⬇️ our configurations in thread
Copy code
kubernetes.jobmanager.replicas, 1
execution.submit-failed-job-on-application-error, true
high-availability.cluster-id, my_name
kubernetes.jobmanager.cpu.limit-factor, 10
pipeline.max-parallelism, 6
kubernetes.service-account, flink
kubernetes.cluster-id, my_name
high-availability.storageDir, <s3://my_bucket/recovery>
taskmanager.memory.flink.size, 1024m
parallelism.default, 1
kubernetes.namespace, flink
fs.s3a.aws.credentials.provider, com.amazonaws.auth.DefaultAWSCredentialsProviderChain
kubernetes.jobmanager.owner.reference, apiVersion:<http://flink.apache.org/v1beta1,kind:FlinkDeployment,uid:a42c1e37-8f5e-4ec0-a04f-00000,name:my_name,controller:false,blockOwnerDeletion:true|flink.apache.org/v1beta1,kind:FlinkDeployment,uid:a42c1e37-8f5e-4ec0-a04f-00000,name:my_name,controller:false,blockOwnerDeletion:true>
state.backend.type, rocksdb
kubernetes.container.image.ref, <http://00000.dkr.ecr.eu-west-3.amazonaws.com/data/my_image_ref|00000.dkr.ecr.eu-west-3.amazonaws.com/data/my_image_ref>
jobmanager.memory.flink.size, 1024m
taskmanager.memory.process.size, 2048m
kubernetes.internal.jobmanager.entrypoint.class, org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
pipeline.name, my_name
execution.savepoint.path, <s3://my_bucket/savepoints/savepoint-044d28-f5000c2e4bc0>
state.backend.local-recovery, false
state.backend.rocksdb.localdir, /opt/flink/state
kubernetes.pod-template-file.taskmanager, /tmp/flink_op_generated_podTemplate_9407366845247969567.yaml
state.backend.incremental, true
web.cancel.enable, false
execution.shutdown-on-application-finish, false
job-result-store.delete-on-commit, false
$internal.pipeline.job-id, 044d28b712536c1d1feed3475f2b8111
taskmanager.memory.managed.fraction, 0.6
$internal.flink.version, v1_19
execution.checkpointing.max-concurrent-checkpoints, 1
kubernetes.pod-template-file.jobmanager, /tmp/flink_op_generated_podTemplate_834737432685891333.yaml
blob.server.port, 6102
kubernetes.jobmanager.annotations, <http://flinkdeployment.flink.apache.org/generation:5|flinkdeployment.flink.apache.org/generation:5>
job-result-store.storage-path, <s3://my_bucket/recovery/job-result-store/my_name/9cf5a2e7-89c6-40e7-94dd-c272a2007000>
fs.allowed-fallback-filesystems, s3
high-availability.type, kubernetes
state.savepoints.dir, <s3://my_bucket/savepoints>
$internal.application.program-args, -pyclientexec;/usr/bin/python3;-py;/opt/flink/usrlib/my_name.py;--restoreMode;CLAIM
taskmanager.numberOfTaskSlots, 2
kubernetes.rest-service.exposed.type, ClusterIP
high-availability.jobmanager.port, 6101
process.working-dir, /tmp/workdir
$internal.application.main, org.apache.flink.client.python.PythonDriver
execution.target, kubernetes-application
jobmanager.memory.process.size, 2048m
taskmanager.rpc.port, 6100
internal.cluster.execution-mode, NORMAL
kubernetes.jobmanager.tolerations, key:dedicated,operator:Equal,value:low-churn,effect:NoSchedule
execution.checkpointing.externalized-checkpoint-retention, RETAIN_ON_CANCELLATION
pipeline.jars, local:///opt/flink/opt/flink-python-1.19.0.jar
state.checkpoints.dir, <s3://my_bucket/checkpoints>
jobmanager.memory.off-heap.size, 134217728b
jobmanager.memory.jvm-overhead.min, 805306368b
jobmanager.memory.jvm-metaspace.size, 268435456b
jobmanager.memory.heap.size, 939524096b
jobmanager.memory.jvm-overhead.max, 805306368b