Hi team we have Spotify Flink k8s Operator v0 4 2 running in Apache Flink #troubleshooting

Hi team, we have Spotify Flink k8s Operator v0.4.2...

Guruguha Marur Sreenivasa

04/04/2023, 4:20 AM

Hi team, we have Spotify Flink k8s Operator v0.4.2 running in production and have some concerns over HA. We have multiple flink clusters and are seeing that when a job manager goes down due to an EKS upgrade or some other reason than anything related to the job itself, the entire job is killed including its task managers and the namespace is wiped out. Has anyone faced this issue before or knows how to mitigate/validate this?

Gyula Fóra

04/04/2023, 10:48 AM

The community does not support the Spotify k8s Operator. Please check https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/

Guruguha Marur Sreenivasa

04/04/2023, 5:56 PM

Does the apache operator support nodeSelector or node / pod affinity rules?

Guruguha Marur Sreenivasa

04/04/2023, 5:59 PM

We want to enable rack aware processing to cut down on data transfer costs.

Guruguha Marur Sreenivasa

04/04/2023, 6:00 PM

I'm looking for something like this:

Copy code

jobManager:
    nodeSelector:
      <http://topology.kubernetes.io/zone|topology.kubernetes.io/zone>: us-east-1a
  taskManager:
    nodeSelector:
      <http://topology.kubernetes.io/zone|topology.kubernetes.io/zone>: us-east-1a

David Christle

04/07/2023, 10:40 PM

@Guruguha Marur Sreenivasa We use the community Flink Kubernetes Operator, and use several nodeSelectors at the application level (same on task manager & job manager). They work just fine. According to the docs (https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/pod-template/) it looks like it’s straightforward to do different ones for the task and job managers. Note the

podTemplate

and then the second

podTemplate

under the

taskManager

. You’d just add in your node selectors in those parts.

Guruguha Marur Sreenivasa

04/08/2023, 9:42 PM

@David Christle thanks for responding. The problem we're trying to solve is to reduce the data transfer cost. For this, we tried to use the nodeSelector as above - although this works, the problem is if an AZ goes down, the entire application goes down. We want to add redundancy to this. I was thinking if there's a way to achieve this. Here's my thought: 1. In the job submitter, get AZs for all task managers that are provisioned. 2. When creating the

KafkaSource

use those AZs and connect to Kafka brokers that are on that AZ so that we can have task managers to be AZ local. 3. This way, we don't incur data transfer costs.

David Christle

04/10/2023, 8:04 AM

@Guruguha Marur Sreenivasa Hmmm. That sounds tricky. If you want redundancy, I believe you need to operate the

JobManager

in HA mode. Otherwise, since the JM usually runs on just one pod, if it crashes, your job goes down. Maybe pod topology spread constraints could help ensure the two JM pods (the main one and the backup) are scheduled in different zones. I’m not sure how to make the broker dynamically selected based on zone within Flink. Even if you could select the brokers in the same zone as the current TM pod, this still has a problem, though - there’s no guarantee that the TaskManager pods are scheduled all in one zone. So, the traffic from Kafka to the individual TMs would stay in the same zone. But the Flink application probably has some shuffle operations, so intermediate data would still cross zones and incur the cost.

Guruguha Marur Sreenivasa

04/10/2023, 3:10 PM

@David Christle thanks for your inputs. We do have HA enabled for our job managers. Also, we don't necessarily need task managers to all be in the same AZ. Just that from a particular AZ that the task manager is in, it goes on to connect to Kafka brokers in the same AZ. I was trying to do this in the submitter but the problem with that is fetching the AZ info for task managers from the submitter. Not sure how that works.

Open in Slack

Previous Next