Kimmo Sääskilahti
09/28/2023, 11:52 AMRESTORE_FROM_LATEST_SNAPSHOT
when updating our ETL job. We don't have any checkpoints enabled at the moment or any scheduled snapshotting.
Today we made a change where we removed one operator from a job. The first deployment failed, probably because we did not have AllowNonRestoredState enabled. So we tried again, this time with AllowNonRestoredState enabled. Flink started fine, but for some reason our operators seem to have lost state. For example, an operator counting daily events started counting again from zero. This operator was not changed in the update.
Questions to gurus:
• Is it possible to see somehow in Amazon Managed Apache Flink which operators start from which state?
• Do you have any idea why we lost state? Could it be related to the failure in the first deployment?
• Could we have avoided this scenario somehow by using scheduled snapshots or checkpointing?
I very much appreciate any comments! And sorry for the n00b questions, I'm just starting with Flink as our team inherited the application from an engineer who left the company 😬Hong Teoh
09/28/2023, 11:54 AMKimmo Sääskilahti
09/28/2023, 11:56 AMuid
set, but the operator in question uses Flink Table API 🤔Hong Teoh
09/28/2023, 1:32 PMMartijn Visser
09/28/2023, 2:26 PMMartijn Visser
09/28/2023, 2:27 PMKimmo Sääskilahti
09/29/2023, 6:25 AMIt all depends on the change that's made in the query if it's indeed compatible or not.Thanks for the comments! We did not make any changes to this query in question. Our pipeline has a single event source reading and parsing events from Kinesis. These events are then forwarded to six branches using five side outputs. One of the side outputs has this SQL query in question. One another side output had another query that we removed. The first one failed to restore state. Could these kinds of changes elsewhere in the job graph also give problems in state compatibility?
Martijn Visser
09/29/2023, 1:32 PM