Hi all Are non primary key joins supported by Flink sql wher Apache Flink #random

Hi all, Are non primary key joins supported by Fli...

Ashish Khatkar

09/06/2023, 10:19 AM

Hi all, Are non primary key joins supported by Flink sql where pkey and join key have M:N relationship? If it is supported, are there any cases which needs to be considered? Also, if I have a Kafka topic partitioned by pkey, to work without any issues should it only have single partition?

Martijn Visser

09/06/2023, 10:55 AM

You can join on any key you want. You need to consider that for regular joins, your state size can grow indefinitely because a streaming join might mean that an update or delete needs to happen downstream. It’s better to evaluate if you can use things like a temporal join

Ashish Khatkar

09/06/2023, 11:02 AM

How will it guarantee result accuracy? Consider a case where a topic has 2 partitions and join key is spread across these two partitions (because topic is partitioned on pkey). Now because of some reason, data from partition 1 is being read slow as compared to partition 2. Consider these messages t1 -> update[pkey1, jkey1, val1] and this is in partition 1 t2 -> update[pkey2, jkey1, val2] and this is in partition 2 now t2 message is read first and it updates the state and produced val2, and now when t1 message is being read after it will update it back to val1 while the correct result for the output was the latest value val2. Will this be overcome by event time processing?

Martijn Visser

09/06/2023, 11:06 AM

First of all, it depends on the type of join you want. It all depends on your requirements

Martijn Visser

09/06/2023, 11:07 AM

Since Flink isn’t tied to Kafka, it uses a variety of things like watermarks, idle sources, watermark alignment, allowed out of orderness etc

Ashish Khatkar

09/06/2023, 11:10 AM

We are providing our users a wrapper over Flink sql to run any sql queries and are currently in the process of migrating from in house system earlier built using datastream api to this new system. In that system we used to have join key in the keyBy and I don’t have full understanding of how Flink sql keys data internally. The query in question is inner join on non pkey attributes.

Martijn Visser

09/06/2023, 11:15 AM

Why not use the Table API?

Martijn Visser

09/06/2023, 11:17 AM

It feels like you’re trying to build and abstract away patterns that have already been included with Flink

Ashish Khatkar

09/06/2023, 11:17 AM

The new system uses the Table API to build tables and descriptors from the avro schema and simply executes the queries (i.e flink sql/table api). We do create watermark based on the input message timestamp for the source table if users want event time processing.

Martijn Visser

09/06/2023, 11:18 AM

So the only thing you’re interested in is understand how the Table API/SQL works under the hood?

Ashish Khatkar

09/06/2023, 11:19 AM

yes, that will be able to help me understand if there are any special cases or things that should be considered.

Martijn Visser

09/06/2023, 11:21 AM

The thing is, that I don’t think it’s feasible to know all the edge cases, especially since they heavily depend on the incoming data, formats, type of joins etc. If a join isn’t supported in your setup, Flink will throw and error. If there’s no error thrown yet it doesn’t work as expected, I think it’s safe to assume it’s a bug

Ashish Khatkar

09/06/2023, 11:22 AM

thanks a lot. From our side, we will keep an eye out and if we discover any discrepancy in the output, we will raise bug on the official github. Have a nice and lovely day 🙂

👍 1

2 Views

Open in Slack

Previous Next