Is there a way to set a watermark on a non-source-...
# random
h
Is there a way to set a watermark on a non-source-table? We process IoT data, and one of the issues we face is clock skew - it’s possible for our event times to be set to future times. We currently have a flow that looks like: Kafka -> flink
least(eventTime, kafkaTime) as eventTime
-> Kafka -> Flink (with watermark on the new
eventTime
) Having to go to and from Kafka adds quite some complexity and overhead to the pipeline, and we’d really like to keep the data in Flink instead. Is there a way to do this?
m
I'm wondering if it makes sense for you to check out watermark alignment, to solve it at the source level? See https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment
h
hmm, not sure how that would help - the clock skew is not caused by clocks in our control, but by the clocks of the IoT devices themselves - which means the kafka partition itself contains records that might look like:
Copy code
eventTime (Iot)     |   timestamp (kafka)
=========================================
2023-01-01T01:01:00 | 2023-01-01T01:00:00 < record from the future
2023-01-01T01:00:00 | 2023-01-01T01:00:01
2023-01-01T00:59:00 | 2023-01-01T01:00:02 < late record
Of course we could just filter the records ourselves (without relying on the watermarks) by looking at the kafka time vs event time, but we’d then lose some of the nice properties of the watermarks + windows
1
m
What do you do with late records?
h
In some cases we take them into account (what’s the last time we heard from x), in others we drop them (what’s the nr of ids with status active between 12:00 and 12:05) which is why the watermarks are so useful, in the first case it’s a top-1 over kafka time, in the second case it’s a window over the event time, which would take into account the watermark
1
r
Clock skew is a real problem, I am curious to understand if there any good solution to overcome this. Some of these devices can be on the field and the streaming system would have no idea if they ran out of battery or went through factory reset, plus most of these devices may not have a time sync service to sync time to local time. Using processing time may not give you the right result, and the event may not even indicate if the device went to a time reset at edge.