hey All I have some questions regarding parallelism and max Apache Flink #random

hey All, I have some questions regarding paralleli...

Sunil Srinivasa

09/05/2023, 8:56 PM

hey All, I have some questions regarding parallelism and max parallelism: 1. is it necessary to set max parallelism considering future data growth from beginning or leave it/set later. 2. I have a job with Keyed state, keys are UUID, in this case, how to determine the max parallelism for the keyed state? 3. how Parallelism works with Kafka connector? i.e any relation between Kafka partition and parallelism?

David Anderson

09/11/2023, 9:45 PM

It's important to set the max parallelism before going into production, and to set it high enough to account for future growth. And in order to help avoid data skew, it's good to set the max parallelism to something like 4-5 times the maximum you actually expect. What this is doing is determining the number of key groups that the keys will be hashed into. Key groups are assigned to task slots, with each slot processing the data for one or more key groups. The number of kafka partitions determines the maximum number of kafka source instances you can assign work to. Your actual parallelism can be higher than this, but if it is you will have some idle instances of the kafka source operator. https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/production_ready/#set-an-explicit-max-parallelism provides some additional information.

👍 1

Sunil Srinivasa

09/11/2023, 11:03 PM

I tried to increase the parallelism beyond the no. of kafka topic partition but I saw wartermark dint progress at all due to Idle tasks

David Anderson

09/11/2023, 11:11 PM

Yes, when you do that you must configure a

withIdleness

interval on the watermark strategy.

🙌 1

2 Views

Open in Slack

Previous Next