Hi All I have a problem I am not sure the best way to solve Apache Flink #troubleshooting

Hi All! I have a problem I am not sure the best wa...

Darin Amos

09/05/2024, 1:01 PM

Hi All! I have a problem I am not sure the best way to solve. I have a highly parallel high volume non-keyed stream that is being fed into a keyed aggregate window. The problem is that records with the same key are processed close together and the volume will overwhelm a single parallel task in a keyed window. I’d like to add a non-keyed “Reducer” upstream of my keyed window to do some pre-aggregation to reduce the load on my window. Something like a session or tumbling window with a few seconds session length would be great, however it seems I cannot have non-keyed windows with greater than 1 parallelism. Does anyone have any suggestions for such a problem?

Darin Amos

09/05/2024, 1:11 PM

The best idea I have right now is to use a keyed window but to append a numeric suffix to the key between 0 and the parallelism I set: •

key_0

•

key_2

•

key_5

This will at least create some distribution to not overload a single parallel operator. But I’d rather use a non-keyed stream because I think it’ll be more efficient.

D. Draco O'Brien

09/05/2024, 5:29 PM

The behavior you’re describing might be a misconfiguration of the Flink Kubernetes Operator or how it is set up to watch multiple namespaces. Check that the update to the Helm chart with the additional namespaces was successfully applied to the Flink Operator’s deployment

Copy code

.kubectl describe deployment <flink-operator-deployment-name>

D. Draco O'Brien

09/05/2024, 5:30 PM

also check logs carefully with DEBUG level settings

D. Draco O'Brien

09/05/2024, 5:31 PM

Copy code

kubectl logs <operator-pod>

Darin Amos

09/05/2024, 5:31 PM

You mean the parallelism being 1? It’s also documented in the docs: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#keyed-vs-non-keyed-windows

In case of non-keyed streams, your original stream will not be split into multiple logical streams and all the windowing logic will be performed by a single task, i.e. with parallelism of 1.

D. Draco O'Brien

09/05/2024, 5:32 PM

That might also be an issue. We need to confirm that the changes you are applying actually get applied to the environment.

D. Draco O'Brien

09/05/2024, 5:33 PM

Describing the deployment will show this. operator pod logs may also shed more light on the problem.

D. Draco O'Brien

09/05/2024, 5:52 PM

Also did you check the Flink operators service account and make sure the RoleBinding exist and is correct?

Darin Amos

09/05/2024, 6:04 PM

I’m not even at the point on dealing with deployments on EKS yet. This is early design phase and testing is only in my IDE. When I try to set parallelism on a non-keyed stream I get get the following which seems to confirm that parallelism > 1 is illegal for non-keyed windows.

Exception in thread “main” java.lang.IllegalArgumentException: The parallelism of non parallel operator must be 1.

at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138)

at org.apache.flink.api.common.operators.util.OperatorValidationUtils.validateParallelism(OperatorValidationUtils.java:35)

at org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator.setParallelism(SingleOutputStreamOperator.java:139)

at org.apache.flink.streaming.api.scala.DataStream.setParallelism(DataStream.scala:120)

Darin Amos

09/05/2024, 6:06 PM

Maybe it’s not allowed because timers are not available on non-keyed operators as well.

D. Draco O'Brien

09/05/2024, 6:08 PM

well your right about the parallelism limitation

D. Draco O'Brien

09/05/2024, 6:08 PM

Maybe look at pre-processing of the keyed state

D. Draco O'Brien

09/05/2024, 6:09 PM

You could look at using TimedWindows or SlidingWindow before the keyed aggregations. That’s one approach

Darin Amos

09/05/2024, 6:10 PM

That’s what I want to do but am not able because I can’t run parallelism over 1.

D. Draco O'Brien

09/05/2024, 6:10 PM

or …

D. Draco O'Brien

09/05/2024, 6:11 PM

Instead of using a non-keyed reducer, you could implement a pre-processing step using keyed state. Before your main keyed aggregation, key the stream by a derived attribute that gives a reasonable distribution given your data. This could be something like hashing a timestamp to create a pseudo-bucket or using a part of the data that distributes the load well.

D. Draco O'Brien

09/05/2024, 6:12 PM

Do a local aggregation by using a map operation with local keyed state to perform lightweight aggregations locally within each parallel instance. The state could store intermediate aggregates per key within each parallel task.

D. Draco O'Brien

09/05/2024, 6:13 PM

After the local aggregation, use a rescale() or rebalance() operator to redistribute the data evenly across tasks. This affectively reshuffles the data.

D. Draco O'Brien

09/05/2024, 6:14 PM

Finally, apply your main keyed window aggregation. Data volume will have been reduced by preprocessing step.

D. Draco O'Brien

09/05/2024, 6:14 PM

I dont think this will violate the parallelism.

Darin Amos

09/05/2024, 6:16 PM

Yeah, that’s basically the solve I posted in the first comment of this thread. Derive a key to create a distribution similar to the parallelism of the operator.

D. Draco O'Brien

09/05/2024, 6:16 PM

It should work but the implementation might need to be adjusted

D. Draco O'Brien

09/05/2024, 6:17 PM

Did you do a rescale() or rebalance() to distribute evenly across tasks?

D. Draco O'Brien

09/05/2024, 6:17 PM

The steps I outlined will not violate the parallelism requirement

Darin Amos

09/05/2024, 6:26 PM

My stream is well balanced. It uses a combination of rebalance and rescale in the appropriate places.

D. Draco O'Brien

09/05/2024, 6:29 PM

Yes, well that error is just a hard no on non-keyed operators for parallel settings

Darin Amos

09/05/2024, 6:35 PM

Yeah… seems like a strange limitation.

Open in Slack

Previous Next