Slackbot
10/11/2022, 7:15 AMXipeng Guan
10/11/2022, 7:15 AMcat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: yatai-deployment-default-domain-test
namespace: yatai-deployment
spec:
template:
metadata:
annotations:
<http://sidecar.istio.io/inject|sidecar.istio.io/inject>: "false"
labels:
app: yatai-default-domain-test
spec:
containers:
- env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: SYSTEM_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
image: <http://quay.io/bentoml/yatai-default-domain:0.0.2|quay.io/bentoml/yatai-default-domain:0.0.2>
imagePullPolicy: IfNotPresent
name: default-domain
resources:
limits:
cpu: "1"
memory: 1000Mi
requests:
cpu: 100m
memory: 100Mi
restartPolicy: Never
schedulerName: default-scheduler
serviceAccount: yatai-deployment
serviceAccountName: yatai-deployment
EOF
Xipeng Guan
10/11/2022, 7:16 AMkubectl -n yatai-deployment logs -f job/yatai-deployment-default-domain-test
Boris Bibic
10/11/2022, 12:52 PMBoris Bibic
10/11/2022, 12:58 PM<http://internal-k8s-yataidep-defaultd-a493d1ab8b-1108353450.us-east-1.elb.amazonaws.com|internal-k8s-yataidep-defaultd-a493d1ab8b-1108353450.us-east-1.elb.amazonaws.com>
looks like the private LB URL, just like I mentioned, but no ALB is getting created. It is hard for me to diagnose what is going on as the logic is inside the application (job), and not in the Helm chart.Xipeng Guan
10/11/2022, 4:04 PMexport DOMAIN_SUFFIX="$(dig +short <http://internal-k8s-yataidep-defaultd-a493d1ab8b-1108353450.us-east-1.elb.amazonaws.com|internal-k8s-yataidep-defaultd-a493d1ab8b-1108353450.us-east-1.elb.amazonaws.com> | head -n 1).<http://sslip.io|sslip.io>"
echo $DOMAIN_SUFFIX
and set the domain suffix with the following command:
kubectl -n yatai-deployment patch cm/network --type merge --patch "{\"data\":{\"domain-suffix\":\"$DOMAIN_SUFFIX\"}}"
Boris Bibic
10/12/2022, 3:37 PMexport DOMAIN_SUFFIX="$(dig +short <http://internal-k8s-yataidep-defaultd-50acf98fdd-488316423.us-east-1.elb.amazonaws.com|internal-k8s-yataidep-defaultd-50acf98fdd-488316423.us-east-1.elb.amazonaws.com> | head -n 1).<http://sslip.io|sslip.io>"
echo $DOMAIN_SUFFIX
.sslip.ion
I was expecting something more, but even this seems to fix the issue that the yatai-deployment
pod is now running. Out of curiosity, do you have some ideas on why would this happen?
The docs state that I don’t need to do anything in the case of sslip.io (and I haven’t). So, something on was not properly configuring the DNS. On AWS side, our VPC has DNS hostnames
and DNS resolution
enabled, and no internal Hosted Zone configured.
We still need to test if the Yatai will work, and not just have a working Kubernetes pod. We will post updates here.
And the idea to add annotations like this:
--set layers.network.ingressAnnotation."<http://alb.ingress.kubernetes.io/scheme|alb.ingress.kubernetes.io/scheme>"=internal
haven’t even add any ingress annotation data to the network
CM. I am testing a few ideas using patch command, but it is more of a hotfix, and not the permanent solution.Xipeng Guan
10/12/2022, 3:42 PMBoris Bibic
10/13/2022, 7:37 AM# dig +short <http://internal-k8s-dev-test-d387934251-0653901712.us-east-1.elb.amazonaws.com|internal-k8s-dev-test-d387934251-0653901712.us-east-1.elb.amazonaws.com> | head -n 1
# 10.0.191.201
But I do not believe that the ALB is ever getting created in case of the Yatai, and that the internal (VPC) DNS is not creating this record that the job logs are showing…Xipeng Guan
10/13/2022, 7:40 AM<http://10.0.191.201.sslip.io|10.0.191.201.sslip.io>
Boris Bibic
10/13/2022, 7:41 AM<http://internal-k8s-yataidep-defaultd-50acf98fdd-488316423.us-east-1.elb.amazonaws.com|internal-k8s-yataidep-defaultd-50acf98fdd-488316423.us-east-1.elb.amazonaws.com>
Boris Bibic
10/13/2022, 7:42 AM.<http://sslip.io|sslip.io>
!Xipeng Guan
10/13/2022, 7:45 AMXipeng Guan
10/13/2022, 7:46 AMexport DOMAIN_SUFFIX=<http://10.0.191.201.sslip.io|10.0.191.201.sslip.io>
Xipeng Guan
10/13/2022, 7:46 AMkubectl -n yatai-deployment patch cm/network --type merge --patch "{\"data\":{\"domain-suffix\":\"$DOMAIN_SUFFIX\"}}"
Boris Bibic
10/13/2022, 7:48 AM10.0.191.201
is not Yatai. Yatai printed URL does not resolve to any IP, not in the VPC, not in the pod (K8s network)!Boris Bibic
10/13/2022, 7:50 AMXipeng Guan
10/13/2022, 7:51 AMBoris Bibic
10/13/2022, 7:52 AMXipeng Guan
10/13/2022, 8:00 AMcat <<EOF | kubectl apply -f -
apiVersion: <http://networking.k8s.io/v1|networking.k8s.io/v1>
kind: Ingress
metadata:
name: test-ingress
annotations: ${your configured annotations}
spec:
ingressClassName: ${your configured ingressClass}
rules:
- http:
paths:
- path: /testpath
pathType: Prefix
backend:
service:
name: test
port:
number: 80
EOF
kubectl get ing test-ingress -o jsonpath='{.status.loadBalancer.ingress}'
So, you should figure out why your load balancer has assigned a URL that cannot be resolvedBoris Bibic
10/13/2022, 9:45 AMdig +short <http://internal-k8s-yataidep-testingr-ecb0a5fe39-56876996.us-east-1.elb.amazonaws.com|internal-k8s-yataidep-testingr-ecb0a5fe39-56876996.us-east-1.elb.amazonaws.com> | head -n 1
But (and this is very important!), it took 2 minutes and 30 seconds to provision the ALB, and in that time the dig was not returning the IP!
In that 2.5 minutes, it did show the URL (internal-k8s-yataidep-testingr-ecb0a5fe39-56876996.us-east-1.elb.amazonaws.com) and the Ingress as up and working, but you could not do the dig or nslookup - it would not return the IP.
After about 2.5 minutes (when the AWS created the ALB and the status went from Provisioning to Active), only then the dig returned the IP of 10.0.177.153 .
So, it is very clear that one cannot just get the URL, and consider that it will have an IP address, the implementation of the Ingress matters! If you have the timeout of, let’s say, 30 seconds, then you will fail to obtain the IP even thought you have the correct URL.
Can we change and extend the time that you are executing dig internally inside the job, or even better, wait for alb IngressClass to actually finish with provisioning the ALB on AWS, and that the ALB will be in Active state?Rustem Salawat
10/13/2022, 3:10 PM2022-10-13T15:07:56+0000 [INFO] [runner:crypto_post:1] Service loaded from Bento directory: bentoml.Service(tag="crypto_post:237an7caxomv2csm", path="/home/bentoml/bento/")
2022-10-13T15:07:56+0000 [INFO] [runner:crypto_post:1] Jax version 0.3.20, Flax version 0.6.0 available.
2022-10-13T15:08:00+0000 [ERROR] [runner:crypto_post:1] Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 671, in lifespan
async with self.lifespan_context(app):
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 566, in __aenter__
await self._router.startup()
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 650, in startup
handler()
File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner.py", line 230, in init_local
self._init_local()
File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner.py", line 221, in _init_local
self._init(LocalRunnerRef)
File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner.py", line 215, in _init
runner_handle = handle_class(self)
File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner_handle/local.py", line 25, in __init__
self._runnable = runner.runnable_class(**runner.runnable_init_params) # type: ignore
File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/frameworks/transformers.py", line 473, in __init__
Sean
10/14/2022, 6:08 AMSean
10/14/2022, 6:09 AMRustem Salawat
10/14/2022, 10:04 AMSean
10/14/2022, 10:24 PMBoris Bibic
10/18/2022, 8:26 AMXipeng Guan
10/18/2022, 10:33 AMBoris Bibic
10/18/2022, 11:56 AMXipeng Guan
10/18/2022, 1:33 PMRustem Salawat
10/21/2022, 1:53 PMerror checking push permissions -- make sure you entered the correct tag name, and that you are authenticated correctly, and try again: checking push permission for "<http://xxxxx.dkr.ecr.us-east-1.amazonaws.com/dev-zenpulsar-pump-yatai-bentos:yatai.neural.1.2|xxxxx.dkr.ecr.us-east-1.amazonaws.com/dev-zenpulsar-pump-yatai-bentos:yatai.neural.1.2>": Post "<https://xxxxxx.dkr.ecr.us-east-1.amazonaws.com/v2/dev-zenpulsar-pump-yatai-bentos/blobs/uploads/>": EOF
Rustem Salawat
10/26/2022, 10:19 PM---
apiVersion: v1
kind: ServiceAccount
metadata:
name: yatai-ecr-registry-sa
---
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
name: yatai-ecr-registry-role
rules:
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["regcred"]
verbs: ["delete"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["create"]
- apiGroups: [""]
resources: ["secrets","pods"]
verbs: ["delete", "list"]
---
kind: RoleBinding
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
metadata:
name: yatai-ecr-registry-rb
subjects:
- kind: ServiceAccount
name: yatai-ecr-registry-sa
apiGroup: ""
roleRef:
kind: Role
name: yatai-ecr-registry-role
apiGroup: ""
---
apiVersion: v1
kind: ConfigMap
metadata:
name: yatai-ecr-registry-cm
data:
AWS_REGION: "us-east-1"
DOCKER_SECRET_NAME: yatai-docker-registry-credentials
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: yatai-ecr-registry-cron
spec:
schedule: "0 */10 * * *"
successfulJobsHistoryLimit: 3
suspend: false
jobTemplate:
spec:
template:
spec:
serviceAccountName: yatai-ecr-registry-sa
containers:
- name: ecr-registry-helper
image: odaniait/aws-kubectl:latest
imagePullPolicy: IfNotPresent
envFrom:
- configMapRef:
name: yatai-ecr-registry-cm
command:
- /bin/sh
- -c
- |-
ECR_TOKEN=`aws ecr get-login-password --region ${AWS_REGION}`
NAMESPACE_NAME=yatai-deployment
kubectl delete secret --ignore-not-found $DOCKER_SECRET_NAME -n $NAMESPACE_NAME
kubectl create secret generic $DOCKER_SECRET_NAME --from-literal=password=$ECR_TOKEN \
--namespace=$NAMESPACE_NAME
kubectl delete pods -l <http://app.kubernetes.io/name=yatai-deployment|app.kubernetes.io/name=yatai-deployment>
echo "Secret was successfully updated at $(date)"
restartPolicy: Never
Xipeng Guan
10/27/2022, 8:38 AMRustem Salawat
10/27/2022, 8:41 AMBoris Bibic
10/27/2022, 9:34 AMyatai-builders
, and one in the yatai
Namespace. First one updates the docker-config
Secret like this:
- /bin/sh
- -c
- |-
ECR_TOKEN=`aws ecr get-login-password --region ${AWS_REGION}`
kubectl delete secret --ignore-not-found $REGCRED_SECRET_NAME
K8S_SECRET=`kubectl create secret docker-registry $REGCRED_SECRET_NAME \
--docker-server=$AWS_ECR --docker-password=$ECR_TOKEN --docker-username=AWS \
--dry-run=client -o jsonpath='{.data.*}' | base64 -d`
kubectl create secret generic $REGCRED_SECRET_NAME --from-literal=config.json=$K8S_SECRET
echo "Secret was successfully updated at $(date)"
The second Job updates the yatai-regcred
Secret with below command inside the Job:
- /bin/sh
- -c
- |-
ECR_TOKEN=`aws ecr get-login-password --region ${AWS_REGION}`
kubectl delete secret --ignore-not-found $REGCRED_SECRET_NAME
kubectl create secret docker-registry $REGCRED_SECRET_NAME --docker-server=$AWS_ECR \
--docker-password=$ECR_TOKEN --docker-username=AWS
echo "Secret was successfully updated at $(date)"
Also, there are 3 variables added to the ConfigMap: AWS_REGION
, REGCRED_SECRET_NAME
and AWS_ECR
. Rest of the code is the same as one posted by @Rustem Salawat.
The reason for 2 Jobs is simple to explain - in case of one Job, the Job would need to have permissions to access Secrets in differences Namespaces, and that adds complexity, and we were in hurry... The Secrets have different structure, so we have 2 methods for replacing them. I’ve set the Jobs’ schedule to 10 hours, just to be safe. Luckily, we didn’t need to restart any pods.Xipeng Guan
10/27/2022, 9:37 AM