Hii Everyone We are doing an upgrade from flink 1...
# troubleshooting
r
Hii Everyone We are doing an upgrade from flink 1.14.2 to 1.16.0, but facing a weird issue where ProcessingTime timer in
ContinuousProcessingTimeTrigger
goes into infinite loop (register+fire). Upon further debugging, it seems that this PR ( 👉 https://github.com/apache/flink/pull/17106/files) is causing the problem. Full Description: We are hitting this issue in time windowing operator.
Copy code
stream.
        .windowAll(TumblingEventTimeWindowAssigner.of(Time.days(1))
        .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(10)))
        .allowedLateness(Time.days(1))
Let's say today is day
T
, then my windowState will also contain pane of
T-1
to `T`th day as well (as per allowedLateness = 1 day) As per
ContinuousProcessingTimeTrigger.registerNextFireTimestamp(. . .)
in 1.16.0,
Copy code
long nextFireTimestamp = Math.min(time + interval, window.maxTimestamp());
Now for window pane of
T-1
to `T`th day
window.maxTimestamp()
will be
T
and
time + interval
will be some
T + x
(where T + x < T + 1) So min of both expressions will always be
T
, that will become timer.getTimestamp(). Now in
InternalTimerServiceImpl.onProcessingTime(long time)
```while ((timer = processingTimeTimersQueue.peek()) != null && timer.getTimestamp() <= time) {
// triggerTarget.onProcessingTime(timer);
}```
timer.getTimestamp() =
T
time =
T + x
So this will never come out of the loop !!! Can someone help me with this on how to proceed further ?
m
Have you tried it with Flink 1.14.3 ? Because that PR has also ended up in 1.14.3, so I wouldn't expect that this a breaking change in your upgrade
r
No, we are directly upgrading from 1.14.2 to 1.16.0. Is that not right approach ? Please guide.
m
It should be fine, but I do wonder if this is really the issue. Else it would have broken everyone who upgraded to Flink 1.14.3 and I haven't heard anyone on that
r
My window is in event-time, but I am using
ContinuousProcessingTimeTrigger
. I assume this should not cause any issue. Please guide if you have a different opinion.
m
It would be great if you can first validate if this also breaks in 1.14.3. Because if it doesn’t, it’s a different issue
r
I am able to reproduce the issue in 1.16.0. I tried with a fresh job (without any savepoint). In below combination, we are hitting the infinite loop issue. window assigner: event-time based window assigner trigger:
ContinuousProcessingTimeTrigger
flink version: 1.16.0 (or any version which has this PR #17106) By further debugging, I found that using event-time based assigner with procTime trigger is causing the issue.
m
Okay, so can I assume that you can’t run a validation on 1.14.3?
r
Same can be reproduced in unit test as well. File: ContinuousProcessingTimeTriggerTest.java test name: testWindowFiring() If we change the window assigner here, to TumblingEventTimeWindows, then it will run into infinite loop. I can test the same on 1.14.3 as well (by doing job deployment), if unit test validation is not sufficient.
Just re-highlighting, I am using event-time assigner with processing-time trigger.
We are able to reproduce the issue in 1.14.3 with unit test as well as job deployment on cluster.
👀 1