Hi I am using flink k8s operator and running flink session c Apache Flink #troubleshooting

Hi, I am using flink k8s operator and running fli...

Akash Patel

08/29/2024, 4:30 PM

Hi, I am using flink k8s operator and running flink session cluster in HA mode - Config- high-availability.type: “KUBERNETES” high-availability.storageDir: “/some/aks/pvc/path” And setting replicas to 3. I see all 3 pods are coming up. One is leader and 2 are in standby. But the issue I am having is when I restart the leader pod, leader is not switching to available options instead waiting for the old leader to restart and reassign leader tag to it. Causing all task managers to restart. Does anyone know if am missing any configuration? Flink 1.18

Arthur Catrisse

08/30/2024, 12:29 PM

Hi, Sorry I don't have a solution to your issue, hope you get an answer ! I am also trying to make the flink k8s operator work in HA, and even when we try to define jobmanager.replicas > 1, it is still interpreted as 1 when it is deployed. By any chance, did you have any issues setting this up ? Or more generally working with the k8s operator ? We're using

FlinkDeployment

and did set

Copy code

high-availability.type: kubernetes
high-availability.storageDir: <s3://my-bucket/recovery>
kubernetes.jobmanager.replicas: "2"

(link to our other issue)

Akash Patel

08/30/2024, 1:46 PM

@Arthur Catrisse I am also using flinkdeployment Instead of setting replica in configuration. You should define it in the jobManager. For ex: Below the flinkConfiguration: Xx.xx: “some conf” jobManager: replicas: 2

Arthur Catrisse

09/04/2024, 2:34 PM

Thank you, this worked for me ! (Although we still have issues with jobmanagers loop-crashing occasionally) Not sure if useful for you, but we got this insight :

About HA, it is benefitial to use it for production in order for job to keep track of the checkpoints and help to recover from last checkpoint automatically.

It is maybe not clearly stated in Flink doc for Kubernetes HA but in K8s HA case, there is always only 1 JM pod exists, there is NO standby replicated JM pods created.

HA only works here as a job store, which this information related to the latest snapshot is getting passed to the newly created JM pod if existing one dies, to recover the job automatically.

Open in Slack

Previous Next