https://www.runatlantis.io/ logo
Title
p

Pardeep Bhatt

04/05/2023, 5:54 AM
Hi Team! We are trying to run atlantis on multiple nodes by mounting the directory
~/.atlantis
using nfs on multiple nodes. We are able to successfully achieve this and if plan request landing on one node and apply request on another node then because of syncing happing for the
~/.atlantis
the apply is running fine and we are good here. But this comes with a new set of problem now i.e. when we have made simultaneous plan requests and when they landed on same node then we are getting the error
Plan Error
The default workspace at path . is currently locked by another command that is running for this pull request.
Wait until the previous command is complete and try again.
which is fine and expected but when the both requests have landed on different-different nodes then there are 3 different scenarios happening 1. first request plan passing and second request plan failing with failed to read plan/state file error. 2. first request plan passing and second request plan failing with error unable to get lock. 3. both request plan are passing. what we want is to somehow if we can give back this error
Plan Error
The default workspace at path . is currently locked by another command that is running for this pull request.
Wait until the previous command is complete and try again.
back to the user in case of multiple plan requests, then that would be fine and we will be good to go ahead. Any sort of will be appreciated. Thanks.
p

PePe Amengual

04/05/2023, 3:53 PM
you are in uncharted territory
Atlantis was not designed for this
so you will have some work to do
do you have Redis locking enabled?
p

Pardeep Bhatt

04/06/2023, 5:56 AM
nope not enabled.
what is that for ?
oh got it. https://www.runatlantis.io/docs/server-configuration.html#locking-db-type atlantis will write to an external db for locks on MR, cmiiw,
and then we can connect to same db from another atlantis node, this way i guess we can achieve what we want,
p

PePe Amengual

04/06/2023, 6:01 AM
let us know how it goes
p

Pardeep Bhatt

04/06/2023, 6:01 AM
sure, i will.
Hi Pepe so we went ahead with using redis and helped us at one point where mutiple MR try to refer the same file path in them, that is getting blocked and we are getting error
Plan Failed: This project is currently locked by an unapplied plan from pull !XX. To continue, delete the lock from !XX or apply that plan and merge the pull request.
Once the lock is released, comment atlantis plan here to re-plan.
so this is fine, but the when on the same MR multiple plan/apply requests are fired irrespective of waiting for the response of first request to came, there are two scaneris which are possible 1. if all requests landed on same node 2. if all requests landed on different node for case 1) it is expected that we will get workspace lock error i.e.
Plan Error
The default workspace at path . is currently locked by another command that is running for this pull request.
Wait until the previous command is complete and try again.
and this is happening as expected, but in case 2) this is not happening, because what we find from code that this information is stored in the application memory and not in any database, so in order to make this info available to other nodes, this info needs to stored in some common shared storage location, like redis can be used again here, but now this comes with 2 I/O operations in a single plan/apply request, because first data will be written in redis that a plan for this path is going to start, like we are doing here but instead of writing it to application memory this time it will be written to redis and similarly this info will be removed from redis once the plan/apply operation is performed, like inside unlock fn, the data will be removed from redis. what do you think of this approach or if you have something else in mind please let me know. we have hard requirement of running atlantis on multi node, because the traffic which we have it can’t be handled by a single node, we need multi node architecture.
p

PePe Amengual

04/12/2023, 12:29 PM
@Nish Krishnan do you have any ideas?
that could work but I’m not an expert on that part of the code. Lyft forked atlantis and added Temporal as a backend to manage it workers queue to solve some related issues of what you are talking about
n

Nish Krishnan

04/12/2023, 3:51 PM
im also not super familiar with how the redis backend was integrated, but what you’re suggesting seems like it would work? But yeah as PePe said, we forked atlantis and rebuilt it to work in a multi-node setup using Temporal workflows, We’ve removed a lot of the features however, to simplify the code we’re maintaining but if you guys are running a similar setup and are interested in hearing more about this/trying it out, im open to chatting.
🙌 1
👍 1
p

Pardeep Bhatt

04/21/2023, 5:52 AM
FYI we have achieved the Multi node setup of atlantis by making the changes suggested above and stored that info to the locking db redis.
p

PePe Amengual

04/21/2023, 4:09 PM
@Pardeep Bhatt that is awesome to hear, do you think you can contribute that back to atlantis?
j

Jon

04/21/2023, 5:55 PM
fwiw, we've had zero issues running atlantis with the redis backend and shared efs storage, and we've been doing it since it was merged. I sorta attempted to show how to do this via this PR: https://github.com/runatlantis/atlantis/pull/2771