02/17/2022, 11:59 AM
Hello, Does the migration to the mandatory intermediate version
takes longer (upgrading from
? Doing it in GKE and its been ~30mins. CPU/Mem being utilized but not seeing anything in logs so wondering. Data is not that high . Counts from tables (configs: 230, jobs: ~8000; attempts: ~100000)


02/17/2022, 12:09 PM
From slack search looks like it might take an hour. Wowza, I will wait and update here.
1hr20mins and counting. Should I bump up memory and/or drop old data somewhere to make this faster? Not sure where time is being spent Pod is running with 1 cpu and 6gb mem. CPU is barely used and memory is almost fully used
Ok double pod memory and running again. 🤞
Hmm, 1hr mem keeps creeping up
Logs if its helpful
That migration dir mentioned in log inside container
Config, Jobs, Attempts size from db ->
Copy code
560 kB,74 MB,262 MB
Have 0 sync_state records in db, I assume that is the migration that is taking long there
Hi @Pras No it should not take much time. But I haven't performed it on EKS or GKE. I have seen sometime that docker doesn't start properly, so you can delete the node and try to recreate it

Akshay Saini

02/17/2022, 3:38 PM
Ok that should not be issue GKE since it auto repairs nodes having such problems.
Anyway it did not do anything for 2hrs so I rolled it back. Have to figure out a better way to upgrade later I guess
If there is interest, I captured jstack output before killing it.
@Pras it wouldn’t took so long; the patch version made the migration in smaller batches to not throw OOM errors when doing the process. If you have a huge historical syncs and a lot of connections, probably… can take longer 🙂
As you showed in your previous post, 100k attempts.. if you can delete them to reduce the time of migration will helps a lot

Justin Reynolds

02/18/2022, 1:42 AM
How can I safely delete them when sync_state is not yet created/migrated if I don't want to lose the last cursor state and avoid full reset. Any recommendations?


02/18/2022, 11:12 AM
Just as update: Luckily we onboard all connections using a unique namespaceFormat from our side, so we used it to only keep the latest succeeded jobs and dropped others, then dropped all attempts where job_id does not exists anymore. Dropped rows to ~60 and migration went very fast, seconds. Hit with another problems on retrieving spec from custom connector, It was because of taints/tolerations, the spec job pod for that version was not adding tolerations I think so it was unschedulable. Ended up creating a temp node pool with just the labels and no tolerations and got through it.