wave Hello team I m an engineer at Yelp we have database co Apache Flink #troubleshooting

:wave: Hello, team! I'm an engineer at Yelp, we ha...

Sergey Anokhovskiy

07/25/2024, 3:49 PM

👋 Hello, team! I'm an engineer at Yelp, we have database connector based on flink. I'm having issues with the window operator. One of the sub-task out of 40 of the TumblingProcessingTimeWindows operator accumulated windows over a day. The next restart of the job caused it to process the accumulated windows, which caused the checkpointing timeout. Once the sub-task has processed the old windows (might take several hours) it works normally again. Could you please come up with the ideas of what might cause the window operator sub-task to accumulate old windows for days until the next restart? I created a jira ticket with all the details and investigation results https://issues.apache.org/jira/browse/FLINK-35899

D. Draco O'Brien

07/25/2024, 3:53 PM

Looks like there might be some state imbalance looking the the JIRA ticket and size of 17th subtask

D. Draco O'Brien

07/25/2024, 3:54 PM

this could be due to uneven distribution of keys (doc ids) across parallel tasks

D. Draco O'Brien

07/25/2024, 3:56 PM

task manager failure and restart seem to coincide with start of the issue maybe task left in inconsistent state …

D. Draco O'Brien

07/25/2024, 3:57 PM

Did you perform a key distribution analysis? you might need to implement a custom partitioner to ensure more even distribution across tasks

D. Draco O'Brien

07/25/2024, 3:58 PM

You also might look at rocksdb tuning. ie buffer sizes cache sizes etc to make sure they are optimized

D. Draco O'Brien

07/25/2024, 3:59 PM

also review your shutdown strategy a bit

Sergey Anokhovskiy

07/25/2024, 3:59 PM

I thought of uneven distribution, but why the job was accumulating windows? If one sub-task got more messages to process it should slow it down and cause checkpointing failure earlier

D. Draco O'Brien

07/25/2024, 4:01 PM

thats a good point. I think you can also look at processing vs. event time

D. Draco O'Brien

07/25/2024, 4:02 PM

its crucial to check that the system clock on the machine running the affected subtask remains accurate through the window processing

D. Draco O'Brien

07/25/2024, 4:03 PM

a drift or inconsistency in processing time can disrupt the window boundaries and result in misalignment

D. Draco O'Brien

07/25/2024, 4:04 PM

These are somewhat rare but could cause this

D. Draco O'Brien

07/25/2024, 4:05 PM

Another thing you could look at is resource starvation. if task manger failure and restar left the subtask in a state where it could access sufficient resources to catch up effectively. I think in a way this is more likely and you might look at resource settings.

D. Draco O'Brien

07/25/2024, 4:07 PM

I think its either that or something related to custom timer firings due to something like incorrect timer setup, missed signals or interference with a custom trigger.

D. Draco O'Brien

07/25/2024, 4:09 PM

Other than that all I can think of is that its trying to checkpoint while the operator is overwhelmed. Need to maybe look closer at how flink checkpointing interacts with the stateful operator. If system tries to checkout while operator is overwhelmed this could cause something like this.

D. Draco O'Brien

07/25/2024, 4:09 PM

Let me take a look at the logs …

D. Draco O'Brien

07/25/2024, 4:17 PM

Ok so you faced a Taskmanager Disconnection. Looks like it crashed or was terminated. This in tern caused a failure of the sink task. Checkpoint was completed before failure.

D. Draco O'Brien

07/25/2024, 4:20 PM

The sub-task might resume processing from the last successful checkpoint but fails to recognize or properly process the accumulated windows that were supposed to be handled between the last checkpoint and the failure. This could stem from incomplete state restoration, misalignment of timers, or issues in the state backend. Given the scale of the problem, it’s possible that during recovery, the system did not properly manage the resumption of processing time timers or the cleanup of outdated state, leading to the backlog of unprocessed windows.

D. Draco O'Brien

07/25/2024, 4:20 PM

let me check logs for the recovery …

D. Draco O'Brien

07/25/2024, 4:25 PM

also check that task manager and job managers have their clocks synced

D. Draco O'Brien

07/25/2024, 4:26 PM

clock skew can cause this by making tasks misinterpret when windows should close

D. Draco O'Brien

07/25/2024, 4:27 PM

what is allowed lateness on the windows? does it seem reasonable or quite long?

Sergey Anokhovskiy

07/25/2024, 4:27 PM

Once the incident happened, I tried to restore the job from the earlier checkpoints and savepoints. I confirmed that the issue has accumulating effect. It means that savepoint from 6 hours and 1 days ago was affected while savepoint 2 days ago was clean

Sergey Anokhovskiy

07/25/2024, 4:28 PM

what is allowed lateness on the windows? does it seem reasonable or quite long?

where i could check it?

D. Draco O'Brien

07/25/2024, 4:30 PM

also I would check carefully the jobs operational timeline from between “clean” savepoint and the first affected one. could be maintenace update etc

D. Draco O'Brien

07/25/2024, 4:32 PM

look for any changes in input data patterns, config updates etc

D. Draco O'Brien

07/25/2024, 4:34 PM

check for late events, out of order events or surge in events

D. Draco O'Brien

07/25/2024, 4:36 PM

the allowed lateness should be in the window definitions somewhere for example:

Copy code

stream
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(30))) // window definition
    .allowedLateness(Time.minutes(5)) // allowing events to be late by 5 minutes
    .process(new YourWindowFunction());

D. Draco O'Brien

07/25/2024, 4:37 PM

You might need to try reduce the allowed lateness as too long can cause accumulation

D. Draco O'Brien

07/25/2024, 4:38 PM

search code and config files for allowedLateness

D. Draco O'Brien

07/25/2024, 4:44 PM

Another thing you can do is try using Flink’s State Processor API to compare the clean save state with the first one that has an issue.

D. Draco O'Brien

07/25/2024, 4:45 PM

https://flink.apache.org/2019/09/13/the-state-processor-api-how-to-read-write-and-modify-the-state-of-flink-applications/

D. Draco O'Brien

07/25/2024, 4:46 PM

if its available for your version of Flink

D. Draco O'Brien

07/25/2024, 4:52 PM

Another little known tool to examine state which is a bit of a black box is Queryable State which has been deprecated but in testing or dev you could use it. Sometimes its needed to debug issues like the one you are describing to understand what is going on.

D. Draco O'Brien

07/25/2024, 4:52 PM

https://jedong.medium.com/flink-queryable-state-fb1125aa679f

D. Draco O'Brien

07/25/2024, 4:53 PM

https://stackoverflow.com/questions/56759533/how-to-query-flinks-queryable-state

D. Draco O'Brien

07/25/2024, 4:53 PM

Its not well documented but it can be used.

D. Draco O'Brien

07/25/2024, 4:54 PM

https://www.youtube.com/watch?v=8qp8BmnMxVk▾

D. Draco O'Brien

07/25/2024, 5:21 PM

In summary, I think we are looking at either a clock/timer configuration issue or skew, or there is an unusual network condition or data volume that periodically occurs for which the window settings/timeouts etc are not adjusted for. I think you need to examine all events and event flows around the timeframe between successful clean save and the incidence to look for patterns.

Open in Slack

Previous Next