Ben Parafina (grimaldi)

04/13/2023, 8:49 PM
just found issue 1571

Dylan Page

04/13/2023, 9:45 PM
There's some ideas floating around, nothing great. This is known among some of the heavy users of the project and hopefully we'll be able to come to consensus on how to approach it

Ben Parafina (grimaldi)

04/14/2023, 11:57 AM
Yeah I'm driving it as the main tf runner for my org and after going through the latest release j was excited to see the redis impl but I'm struggling now to justify it as a long term tenant in our infra. I want to use it as part of our DR strategy which would imply needing to scale the shit out of it in a pinch.

Bruno Schaatsbergen

04/18/2023, 8:59 PM
I've always been curious why you want to run Atlantis HA though, and how it fits into the DR strategy of an org - and whether it's worth the added complexity for what you get. Convince me!

Ben Parafina (grimaldi)

04/18/2023, 9:05 PM
Namely this — because we run it in k8s vertically scaling a pod requires a pod restart which means that we would have to know ahead of time the approximate resource cost of re-planning and applying the orgs resources. This is a nontrivial amount of time and in a DR scenario time is of the essence. Ideally I’d like to run a fleet of horizontally scalable pods that I can connect to Keda/Karpenter to dynamically allocate resources based around incoming plan requests
This way I don’t have to be concerned with rightsizing my cluster from the outset and can allow the incoming work dictate the footprint of my cluster
being able to horizontally scale my cluster as opposed to vertically also allows me to run a cheaper fleet overall for most day to day operations
I can run smaller underlying k8s nodes as spot instances, which is nice for keeping the monthlies low
Now running the entire graph at once is a fools errand — I’ll grant that. But making it so that a change to the root terragrunt.hcl in my project doesn’t consume all resources and blocking any other potential work would be nice, and having additional nodes be on-demand rather than in a single instance would be even better.
Combine this central resource requirement that’s quite large with the added pressure of needing to onboard other groups to the platform I’m finding myself trying to find ways to allow the Atlantis (with k8s help) to be mostly self managing if possible. My team also manages a number of other tools as well as our customer facing verticals.