Hello, I'm looking to migrate a running job(A) deployment method from standalone to
native deployment kubernetes. However, the job broke with multiple errors like "akka.framesize" too small or Java heap out of space. I managed to get job A back to a steady running state on standalone but I'm at a standstill and worried if I try again, the job will break again.
The job has somewhat large state for checkpoints(~10GB). I also cannot afford to dumb/clean the state. I noticed in the
docs that it says to not upgrade Flink and Kafka connector at the same time. I missed this and did the upgrade at the same time. I'm not sure if this is the reason for the issue but putting it there as a possibility.
My question:
• What would be the best actions to ensure a safe upgrade without having the job go down again?