Hey everyone, I have a use case I’m considering Fl...
# troubleshooting
a
Hey everyone, I have a use case I’m considering Flink for but I have a feeling that it might be slightly unusual, can anyone help me get a feel for whether it’s a good idea or not? What we would like to do is use Flink to join 4 to 5 topics on a key and emit to a single topic which is easy for application developers to consume and project in local storage. The data in question puts these constraints on our solution - we’ll be joining probably about 4 or 5 different Kafka topics on a single key attribute - we will have records which are “arbitrarily late arriving”, which means that if we get a record we need to keep it in state forever so that the emitted join reflects the current state without missing attributes (e.g. the record from the last month).
s
on a personal opinion, i think joining 4 or 5 topics is fine, but the 2nd requirement of
arbitrarily late arriving
+ not allowing droping records kills the architecture. because realistically to handle that and not drop records, your watermark configured needs to wait for this
arbitrarily late arriving
, assuming worse case scenario is 1month, it means realistically for Flink to say that all events have been received and we can proceed with moving the watermark , 1 month would have already passed (which begs the question, why not solve this at the data warehouse, since you would delay)
a
Thanks a lot for sharing your view. It looks like I was right to be worried. But if I solved this at a data warehouse instead of in stream processing, would it then be possible to publish the joined product as a new Kafka topic? I always think about data warehouses as a terminus for data, but that might just be my misunderstanding.
Also is it not possible to do computation without watermarks, and instead on arrival of each record?
Apologies if my questions are naive, I’m new in this area.
s
Or maybe solutioning at the data warehouses is not the right word, but maybe something like the lambda architecture to have a offline batched processing process do the computation. (can be on spark then publishing to Kafka is a separate issue) For the question of watermarks, if you were to do traditional joins you would need them because in the scenario where message A1 cannot join to any messages in stream B, the engine needs to know when to give up and declare that A1 has no join and either drop or emit with null (outer join) in that sense watermark comes into play to tell the engine when tk give up because all messages up to that watermark in stream B has been delivered
Maybe something of an alternative view:

https://www.youtube.com/watch?v=tiGxEGPyqCg

(it is technically possible but with tradeoffs)
👀 1
a
Thanks!