Flink restored from empty state Hi everyone We re running F Apache Flink #troubleshooting

Flink restored from empty state Hi everyone! We'r...

Kimmo Sääskilahti

09/28/2023, 11:52 AM

Flink restored from empty state Hi everyone! We're running Flink 1.13 in Amazon Managed Apache Flink. We have managed snapshots (Flink savepoints) enabled and we're using

RESTORE_FROM_LATEST_SNAPSHOT

when updating our ETL job. We don't have any checkpoints enabled at the moment or any scheduled snapshotting. Today we made a change where we removed one operator from a job. The first deployment failed, probably because we did not have AllowNonRestoredState enabled. So we tried again, this time with AllowNonRestoredState enabled. Flink started fine, but for some reason our operators seem to have lost state. For example, an operator counting daily events started counting again from zero. This operator was not changed in the update. Questions to gurus: • Is it possible to see somehow in Amazon Managed Apache Flink which operators start from which state? • Do you have any idea why we lost state? Could it be related to the failure in the first deployment? • Could we have avoided this scenario somehow by using scheduled snapshots or checkpointing? I very much appreciate any comments! And sorry for the n00b questions, I'm just starting with Flink as our team inherited the application from an engineer who left the company 😬

Hong Teoh

09/28/2023, 11:54 AM

Did you set the UUID of each operator? That’s the tag that Flink uses to match state from snapshot to operator when restored. It could be that the UUID has now changed after you edited the job graph

Kimmo Sääskilahti

09/28/2023, 11:56 AM

Hmm good question! All operators using the DataStream API have

uid

set, but the operator in question uses Flink Table API 🤔

Hong Teoh

09/28/2023, 1:32 PM

Hmm, I think the Flink Table API has not very good guarantees on state compatibility on job graph changes

Martijn Visser

09/28/2023, 2:26 PM

It all depends on the change that's made in the query if it's indeed compatible or not. It's documented at https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/upgrading/#table-api--sql

Martijn Visser

09/28/2023, 2:27 PM

There is FLIP-190 which tackles the upgrades from minor version https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=191336489 and future improvements are planned as highlighted in the roadmap https://flink.apache.org/roadmap/#going-beyond-a-sql-streambatch-processing-engine

👍 1

Kimmo Sääskilahti

09/29/2023, 6:25 AM

It all depends on the change that's made in the query if it's indeed compatible or not.

Thanks for the comments! We did not make any changes to this query in question. Our pipeline has a single event source reading and parsing events from Kinesis. These events are then forwarded to six branches using five side outputs. One of the side outputs has this SQL query in question. One another side output had another query that we removed. The first one failed to restore state. Could these kinds of changes elsewhere in the job graph also give problems in state compatibility?

Martijn Visser

09/29/2023, 1:32 PM

Yes, because when you use SQL/Table, the job graph gets generator for you. When you change any SQL/Table API in your app, the generated job graph can be changed

👍 1

2 Views

Open in Slack

Previous Next