hey All, I have some questions regarding paralleli...
# random
s
hey All, I have some questions regarding parallelism and max parallelism: 1. is it necessary to set max parallelism considering future data growth from beginning or leave it/set later. 2. I have a job with Keyed state, keys are UUID, in this case, how to determine the max parallelism for the keyed state? 3. how Parallelism works with Kafka connector? i.e any relation between Kafka partition and parallelism?
d
It's important to set the max parallelism before going into production, and to set it high enough to account for future growth. And in order to help avoid data skew, it's good to set the max parallelism to something like 4-5 times the maximum you actually expect. What this is doing is determining the number of key groups that the keys will be hashed into. Key groups are assigned to task slots, with each slot processing the data for one or more key groups. The number of kafka partitions determines the maximum number of kafka source instances you can assign work to. Your actual parallelism can be higher than this, but if it is you will have some idle instances of the kafka source operator. https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/production_ready/#set-an-explicit-max-parallelism provides some additional information.
πŸ‘ 1
s
I tried to increase the parallelism beyond the no. of kafka topic partition but I saw wartermark dint progress at all due to Idle tasks
d
Yes, when you do that you must configure a
withIdleness
interval on the watermark strategy.
πŸ™Œ 1