https://www.runatlantis.io/ logo
Title
j

Justin S

04/17/2023, 3:20 PM
Anyone able to assist with Atlantis breaking in our prod environment? All plans started saying ``│ /usr/bin/git exited with 128: fatal: detected dubious ownership in` Restarting the container now fails with
Error: initializing server: writing generated .git-credentials file with user, token and hostname to /nonexistent/.git-credentials: open /nonexistent/.git-credentials: no such file or directory
p

PePe Amengual

04/17/2023, 3:27 PM
you upgraded the git client version ?
j

Justin S

04/17/2023, 3:27 PM
No.
p

PePe Amengual

04/17/2023, 3:27 PM
look at the issue
j

Justin S

04/17/2023, 3:27 PM
This is all in Kubernetes, no change in image, or helm chart
p

PePe Amengual

04/17/2023, 3:28 PM
this has been documented
j

Justin S

04/17/2023, 3:28 PM
i have seen the issue
What was posted there, doesnt appear to match up
p

PePe Amengual

04/17/2023, 3:29 PM
what about the permissions of the Atlantis data dir?
j

Justin S

04/17/2023, 3:29 PM
They wouldnt randomly change
p

PePe Amengual

04/17/2023, 3:29 PM
could that have changed ?
j

Justin S

04/17/2023, 3:29 PM
statefulSet:
      securityContext:
        fsGroup: 1000
        runAsUser: 100
        fsGroupChangePolicy: "OnRootMismatch"
no change.
p

PePe Amengual

04/17/2023, 3:31 PM
well something changed for sure and the Atlantis code does not do that , so if for whatever reason the filesystem permissions changed you could possibly get this error
j

Justin S

04/17/2023, 3:31 PM
i can not imagine a situation where out of 100+ PVC's, this one randomly changed
p

PePe Amengual

04/17/2023, 3:31 PM
can you run like a chown on the pod ?
j

Justin S

04/17/2023, 3:31 PM
I cant launcht he pod
because it fails with the /nonexistent error
❯ kubectl logs pod/atlantis-0 -n atlantis
{"level":"info","ts":"2023-04-17T15:32:41.901Z","caller":"vcs/gitlab_client.go:110","msg":"determined GitLab is running version 15.10.0","json":{}}
Error: initializing server: writing generated .git-credentials file with user, token and hostname to /nonexistent/.git-credentials: open /nonexistent/.git-credentials: no such file or directory
just spun up a test ENV
same error
def. tied just to atlantis
did a re-deploy int eh test ENV of everything using EFS
only atlantis breaks
p

PePe Amengual

04/17/2023, 3:48 PM
efs is nfs basically
j

Justin S

04/17/2023, 3:49 PM
Yes.
p

PePe Amengual

04/17/2023, 3:49 PM
could you try and ebs volume for Atlantis ?
get it out of efs?
j

Justin S

04/17/2023, 3:49 PM
one sec, im spinning up a 3rd test ENV
p

PePe Amengual

04/17/2023, 3:49 PM
in your test env
j

Justin S

04/17/2023, 3:49 PM
with EFS and no pod sec. templates
since a fresh install with the defaults fail
One thing thats interesting, is the github issue, and the helm chart all assume the atlantis user is 100:1000, but its 1000:1000 on the debian image
probably a good reason to specify the UID here
# Add atlantis user to Debian as well
RUN useradd --create-home --user-group --shell /bin/bash atlantis && \
    adduser atlantis root && \
    chown atlantis:root /home/atlantis/ && \
    chmod g=u /home/atlantis/ && \
    chmod g=u /etc/passwd
ok, its back to the original error
dubious permissions
p

PePe Amengual

04/17/2023, 3:59 PM
in a non efs volume ?
j

Justin S

04/17/2023, 4:00 PM
still EFS
i have to setup a EBS provisioner, since we dont use it anywhere
but, im atleast back to atlantis being broken.
drwxrwxr-x   5 1001 1001 6.0K Apr 17 15:55  atlantis-data
thats interesting, since that uid/gid does not exist
p

PePe Amengual

04/17/2023, 4:01 PM
ohhhhh
j

Justin S

04/17/2023, 4:05 PM
I have no idea where that comes from TBH
and latest debian image is completely broken
Digest: sha256:5389ae79b49230b8e4c6a305230f69e43f65c0cd4312f3822684880e39fdb47c
Status: Downloaded newer image for <http://ghcr.io/runatlantis/atlantis:v0.23.4-debian|ghcr.io/runatlantis/atlantis:v0.23.4-debian>
error: failed switching to "atlantis": unable to find user atlantis: no matching entries in passwd file
p

PePe Amengual

04/17/2023, 4:15 PM
I think that @Dylan Page is fixing or fixed that
we had a few issues with the debían image
j

Justin S

04/17/2023, 4:16 PM
I dont understand why this is mounting the volume as 1001
i would expect it to mount it as root honestly
d

Dylan Page

04/17/2023, 4:16 PM
It’s fixed in the latest dev image
j

Justin S

04/17/2023, 4:17 PM
im still on the 23.3 image
which worked up until.. today
p

PePe Amengual

04/17/2023, 4:27 PM
the thing is that the permissions are done at the dockerfile level and after that atlantis run
the code just uses the filesystem so there is no way for atlantis to change that
j

Justin S

04/17/2023, 4:28 PM
we deploy ia atlantis
i have no MR showing a change
we didnt change
we did a fresh deploy, same error
and the permissions are not done at the dockerfile level.
p

PePe Amengual

04/17/2023, 4:28 PM
did you try the non EFS install?
j

Justin S

04/17/2023, 4:28 PM
we only use EFS
in 20 clusters
and only atlantis is having this error
Getting EBS provisioners in place will take a while
to troubleshoot a broken app
p

PePe Amengual

04/17/2023, 4:31 PM
any information you have add it to the issue you commented since this chat will disappear in a month
j

Justin S

04/17/2023, 4:31 PM
ill try to get it in place
currently have to back atlantis out and move things back into flux for the time being
so we can operate
p

PePe Amengual

04/17/2023, 4:31 PM
I do not run atlantis in K8s so I can’t help you much
j

Justin S

04/17/2023, 5:07 PM
so i found another instance of atlantis we have
deployed with the same code, same base image, only diff is this image we add vault cli
its working
I can tell you that EFS creates a random UID/GID for the mount point, which is where the 10001 or 10002 etc stuff shows up from
im not sure why that would randomly be a problem now.
so a customer EFS storage class is created.. and setting uid/gid, instead of using the Range like it defaults too
drwxrwxr-x   4 atlantis atlantis 6.0K Apr 17 17:07  atlantis-data
so that looks more correct
p

PePe Amengual

04/17/2023, 5:12 PM
can you force the uuid on the chart? I believe is possible
j

Justin S

04/17/2023, 5:12 PM
we have been doing that
now ive forced EFS to match
testing now
That fixed it
im not sure why the OTHER env is working, because we are not specifying a UID/GID on EFS there, just on the chart.
p

PePe Amengual

04/17/2023, 5:14 PM
I worked with NFS for like 10 years, I always had to force uuid and user
that is why I wanted you to test EBS just to make sure
j

Justin S

04/17/2023, 5:15 PM
this is the only chart, where we force it
all using EFS
i think we have 20 clusters at this point.
The other atlantis deployment is working, without it being specified as well, which has me more confused
now it looks like atlantis has decided to use its own TF version
{
  "level": "info",
  "ts": "2023-04-17T17:10:47.192Z",
  "caller": "terraform/terraform_client.go:361",
  "msg": "Detected module requires version: 1.4.5",
  "json": {
    "repo": "sphinx/terraform-aws-infra",
    "pull": "75"
  }
}
ATLANTIS_DEFAULT_TF_VERSION=v1.3.7
DEFAULT_TERRAFORM_VERSION=1.4.2
not even sure where that second ENV var comes from now..
p

PePe Amengual

04/17/2023, 5:17 PM
in your workflow you have
required_version
? that version will be used
j

Justin S

04/17/2023, 5:18 PM
required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.0"
    }
  }
  required_version = "~> 1.3"
}
thats what is defined everywhere
environment:
      ATLANTIS_DEFAULT_TF_VERSION: v1.3.7
Ya, i dont actually see
DEFAULT_TERRAFORM_VERSION
even defined in the chart
j

Justin S

04/17/2023, 5:25 PM
im not sure what im supposed to be seeing?
required_version = "~> 1.3"
should not resolve to 1.4.5
ya, this doesnt make sense
NOTE
The Atlantis latest docker imageopen in new window
tends to have recent versions of Terraform, but there may be a delay as new versions are released. The highest version of Terraform allowed in your code is the version specified by
DEFAULT_TERRAFORM_VERSION
in the image your server is running.
DEFAULT_TERRAFORM_VERSION=1.4.2
and yet it pulls 1.4.5
when we say
required_version = "~> 1.3"
p

PePe Amengual

04/17/2023, 5:28 PM
atlantis can download TF for you no matter what the default TF version is
j

Justin S

04/17/2023, 5:29 PM
the doc imply it wouldnt use 1.4.5
since the image states
DEFAULT_TERRAFORM_VERSION=1.4.2
either way though, we specify 1.3.
{
  "level": "info",
  "ts": "2023-04-17T17:13:23.792Z",
  "caller": "models/shell_command_runner.go:156",
  "msg": "successfully ran \"/atlantis-data/bin/terraform1.4.5 plan -input=false -refresh -out \\\"/atlantis-data/repos/sphinx/terraform-aws-infra/75/default/us-gov-west-1/qa/network/default.tfplan\\\"\" in \"/atlantis-data/repos/sphinx/terraform-aws-infra/75/default/us-gov-west-1/qa/network\"",
  "json": {
    "repo": "sphinx/terraform-aws-infra",
    "pull": "75"
  }
}
❯ cat us-gov-west-1/qa/network/provider.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.0"
    }
  }
  required_version = "~> 1.3"
}
The patch is relevant in 
~> 1.3.0
 to ensure it matches on only 
1.3.x
 whereas 
~> 1.3
 may match to 
1.4
why would it even match that
now im curious if its been using 1.4.5 this entire time
p

PePe Amengual

04/17/2023, 5:35 PM
you could check the state file
j

Justin S

04/17/2023, 5:35 PM
doing so now
def. feels wrong that 1.3 could match 1.4
"terraform_version": "1.3.9",
OK, im good now
TY
p

PePe Amengual

04/17/2023, 5:42 PM
you fixed yourself!!!!
thanks to you
where you using 1.4.5?
d

Dylan Page

04/17/2023, 6:21 PM
~= or ~> is essentially a wildcard on the lowest tier version.
It’s common in most package managers
You’re basically saying “any version no less than 1.3, and no greater than 2.0”
If you do 1.3.0, it changes to “any version no less than 1.3.0, and no greater than 1.3”