This message was deleted.
# ask-for-help
s
This message was deleted.
x
Copy code
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: yatai-deployment-default-domain-test
  namespace: yatai-deployment
spec:
  template:
    metadata:
      annotations:
        <http://sidecar.istio.io/inject|sidecar.istio.io/inject>: "false"
      labels:
        app: yatai-default-domain-test
    spec:
      containers:
      - env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SYSTEM_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: <http://quay.io/bentoml/yatai-default-domain:0.0.2|quay.io/bentoml/yatai-default-domain:0.0.2>
        imagePullPolicy: IfNotPresent
        name: default-domain
        resources:
          limits:
            cpu: "1"
            memory: 1000Mi
          requests:
            cpu: 100m
            memory: 100Mi
      restartPolicy: Never
      schedulerName: default-scheduler
      serviceAccount: yatai-deployment
      serviceAccountName: yatai-deployment
EOF
Copy code
kubectl -n yatai-deployment logs -f job/yatai-deployment-default-domain-test
b
kubectl -n yatai-deployment logs -f job/yatai-deployment-default-domain-test Found 7 pods, using pod/yatai-deployment-default-domain-test-xmdr5 time=“2022-10-11T110037Z” level=info msg=“Creating ingress default-domain- to get a ingress IP automatically” time=“2022-10-11T110037Z” level=info msg=“Waiting for ingress default-domain-chplz to be ready” time=“2022-10-11T110107Z” level=info msg=“Ingress default-domain-chplz is ready” panic: Error getting domain suffix: failed to resolve ip address for hostname internal-k8s-yataidep-defaultd-a493d1ab8b-1108353450.us-east-1.elb.amazonaws.com: lookup internal-k8s-yataidep-defaultd-a493d1ab8b-1108353450.us-east-1.elb.amazonaws.com on 172.20.0.1053 no such host goroutine 1 [running]: main.main() /workspace/pkg/main.go:22 +0xa5
@Xipeng Guan Is there anything else you want me to try? The
<http://internal-k8s-yataidep-defaultd-a493d1ab8b-1108353450.us-east-1.elb.amazonaws.com|internal-k8s-yataidep-defaultd-a493d1ab8b-1108353450.us-east-1.elb.amazonaws.com>
looks like the private LB URL, just like I mentioned, but no ALB is getting created. It is hard for me to diagnose what is going on as the logic is inside the application (job), and not in the Helm chart.
x
@Boris Bibic Oh, I known what you mean and what happens, can your manually generate the domain suffix with the following command in your VPN environment?
Copy code
export DOMAIN_SUFFIX="$(dig +short <http://internal-k8s-yataidep-defaultd-a493d1ab8b-1108353450.us-east-1.elb.amazonaws.com|internal-k8s-yataidep-defaultd-a493d1ab8b-1108353450.us-east-1.elb.amazonaws.com> | head -n 1).<http://sslip.io|sslip.io>"
echo $DOMAIN_SUFFIX
and set the domain suffix with the following command:
Copy code
kubectl -n yatai-deployment patch cm/network --type merge --patch "{\"data\":{\"domain-suffix\":\"$DOMAIN_SUFFIX\"}}"
b
@Xipeng Guan The this value of DOMAIN_SUFIX that I get is this:
Copy code
export DOMAIN_SUFFIX="$(dig +short <http://internal-k8s-yataidep-defaultd-50acf98fdd-488316423.us-east-1.elb.amazonaws.com|internal-k8s-yataidep-defaultd-50acf98fdd-488316423.us-east-1.elb.amazonaws.com> | head -n 1).<http://sslip.io|sslip.io>"

echo $DOMAIN_SUFFIX
.sslip.ion
I was expecting something more, but even this seems to fix the issue that the
yatai-deployment
pod is now running. Out of curiosity, do you have some ideas on why would this happen? The docs state that I don’t need to do anything in the case of sslip.io (and I haven’t). So, something on was not properly configuring the DNS. On AWS side, our VPC has
DNS hostnames
and
DNS resolution
enabled, and no internal Hosted Zone configured. We still need to test if the Yatai will work, and not just have a working Kubernetes pod. We will post updates here. And the idea to add annotations like this:
--set layers.network.ingressAnnotation."<http://alb.ingress.kubernetes.io/scheme|alb.ingress.kubernetes.io/scheme>"=internal
haven’t even add any ingress annotation data to the
network
CM. I am testing a few ideas using patch command, but it is more of a hotfix, and not the permanent solution.
x
@Boris Bibic Did you run the dig command in your VPN environment? That means your VPN environment can't resolve the private domain name either
b
@Xipeng Guan Yes, I did run it with VPN enabled. I can tell you it did resolve the existing ALB:
Copy code
# dig +short <http://internal-k8s-dev-test-d387934251-0653901712.us-east-1.elb.amazonaws.com|internal-k8s-dev-test-d387934251-0653901712.us-east-1.elb.amazonaws.com> | head -n 1
# 10.0.191.201
But I do not believe that the ALB is ever getting created in case of the Yatai, and that the internal (VPC) DNS is not creating this record that the job logs are showing…
x
@Boris Bibic Sorry, I don't get you, but since you've got this IP, you can manually set the domain-suffix to
<http://10.0.191.201.sslip.io|10.0.191.201.sslip.io>
b
I haven’t got the IP of the LB that the Job is printing me, no IP for the
<http://internal-k8s-yataidep-defaultd-50acf98fdd-488316423.us-east-1.elb.amazonaws.com|internal-k8s-yataidep-defaultd-50acf98fdd-488316423.us-east-1.elb.amazonaws.com>
This was the IP of my other LB that has nothing to do with Yatai, so I can’t add nothing to the domain-suffix apart from
.<http://sslip.io|sslip.io>
!
x
Because the network in the container is not in the VPN, the domain cannot be resolved in the container, but it does not matter, you just need to be able to resolve the domain in your VPN environment, and set the resolved IP splice into the sslip.io domain as domain-suffix, because the domain-suffix splice into the domain name is for users in the VPN environment
Copy code
export DOMAIN_SUFFIX=<http://10.0.191.201.sslip.io|10.0.191.201.sslip.io>
Copy code
kubectl -n yatai-deployment patch cm/network --type merge --patch "{\"data\":{\"domain-suffix\":\"$DOMAIN_SUFFIX\"}}"
b
@Xipeng Guan What did you not understood in “I haven’t got the IP of the LB that the Job is printing me”? This IP
10.0.191.201
is not Yatai. Yatai printed URL does not resolve to any IP, not in the VPC, not in the pod (K8s network)!
You want me to connect Yatai domain-suffix to some random LB, that has zero connection to the EKS, let alone to the EKS that the Yatai is running it? Please, understand that the Job is generating rubbish URL that does not exist in the entire VPC. And tell me if we can somehow fix the Job, so that the URL is not a random generated rubbish that looks like the URL of LB, but actually becomes a valid record in the DNS of the VPC.
x
Sorry, I mistakenly thought that the domain you gave here was the same as the previous one
b
@Xipeng Guan no problem, I should have been more precise there. It is important that you understand the issue, and that there is no IP for the URL from Yatai Job, and that the VPC DNS is working as expected, so this Yatai URL is never getting added to the DNS.
x
@Boris Bibic Hi, let me explain how yatai prints out this url, yatai creates a temporary ingress resource according to the ingressClass and ingressAnnotation you configured, waits for the ingress resource to be assigned the adress field, and then gets the address field, exactly the same behavior as using the following command
Copy code
cat <<EOF | kubectl apply -f -
apiVersion: <http://networking.k8s.io/v1|networking.k8s.io/v1>
kind: Ingress
metadata:
  name: test-ingress
  annotations: ${your configured annotations}
spec:
  ingressClassName: ${your configured ingressClass}
  rules:
  - http:
      paths:
      - path: /testpath
        pathType: Prefix
        backend:
          service:
            name: test
            port:
              number: 80
EOF
Copy code
kubectl get ing test-ingress -o jsonpath='{.status.loadBalancer.ingress}'
So, you should figure out why your load balancer has assigned a URL that cannot be resolved
b
@Xipeng Guan I think I figured it out! The ingress test-ingress that you provided is working, and I can get the IP:
Copy code
dig +short <http://internal-k8s-yataidep-testingr-ecb0a5fe39-56876996.us-east-1.elb.amazonaws.com|internal-k8s-yataidep-testingr-ecb0a5fe39-56876996.us-east-1.elb.amazonaws.com> | head -n 1
But (and this is very important!), it took 2 minutes and 30 seconds to provision the ALB, and in that time the dig was not returning the IP! In that 2.5 minutes, it did show the URL (internal-k8s-yataidep-testingr-ecb0a5fe39-56876996.us-east-1.elb.amazonaws.com) and the Ingress as up and working, but you could not do the dig or nslookup - it would not return the IP. After about 2.5 minutes (when the AWS created the ALB and the status went from Provisioning to Active), only then the dig returned the IP of 10.0.177.153 . So, it is very clear that one cannot just get the URL, and consider that it will have an IP address, the implementation of the Ingress matters! If you have the timeout of, let’s say, 30 seconds, then you will fail to obtain the IP even thought you have the correct URL. Can we change and extend the time that you are executing dig internally inside the job, or even better, wait for alb IngressClass to actually finish with provisioning the ALB on AWS, and that the ALB will be in Active state?
r
@Xipeng Guan hi, please help. When we deploy runner (in kube), they get in logs. Service is worked. Runner not.
Copy code
2022-10-13T15:07:56+0000 [INFO] [runner:crypto_post:1] Service loaded from Bento directory: bentoml.Service(tag="crypto_post:237an7caxomv2csm", path="/home/bentoml/bento/")
2022-10-13T15:07:56+0000 [INFO] [runner:crypto_post:1] Jax version 0.3.20, Flax version 0.6.0 available.
2022-10-13T15:08:00+0000 [ERROR] [runner:crypto_post:1] Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 671, in lifespan
    async with self.lifespan_context(app):
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 566, in __aenter__
    await self._router.startup()
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 650, in startup
    handler()
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner.py", line 230, in init_local
    self._init_local()
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner.py", line 221, in _init_local
    self._init(LocalRunnerRef)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner.py", line 215, in _init
    runner_handle = handle_class(self)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner_handle/local.py", line 25, in __init__
    self._runnable = runner.runnable_class(**runner.runnable_init_params)  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/frameworks/transformers.py", line 473, in __init__
s
I cant help but notice some not very friendly comments in this discussion thread. I understand that communication over text can be hard and sometimes we get frustrated if the messages do not get through. I kindly ask to please be more patient with our exchanges and keep this community more friendly and engaging.
@Rustem Salawat I don’t think the traceback you posted is complete. It is cut off at the end.
r
@Sean I’m sorry if I offended you with something, we respect and are very grateful for your work, we didn’t mean to offend you. perhaps it was my level of communication (not native english) that seemed rude, I tried to be more careful. thanks again.
s
@Rustem Salawat Thanks for your understanding. I was particularly referring to an exchange during the ingress setup discussion. Again I understand communication over text can be hard. Just a reminder for us to be mindful of our words and expressions. Ultimately the community is here to support each other. 🙂
b
Hi @Sean, @Xipeng Guan, how are you? Is there anything you could do to help @Rustem Salawat and I deploy the Yatai Ingress? Like mentioned, the Job is triggering the creation of Ingress defined in the IngressClass, and tries to get the Ingress IP (AWS ALB’s IP to be more precise), but it needs to wait at least 2.5 minutes before it does dig command to get the IP. Is this possible to solve? There is nothing I could do on my side to make AWS return the IP of the ALB that is being provisioned, nor can I speed up the provisioning process to get the IP of the ALB quicker. So unfortunately, as I see it, there is nothing I can do to make the AWS comply. The only option left is to ask you for your help to try inserting the timeout/waiting period between initial command to create the Ingress (ALB) and the dig command (getting the IP of the Ingress-ALB). Thanks in advanced for everything you did for us so far! 🙂
x
b
Hi @Xipeng Guan, I can try using private Route 53 Hosted Zone, but I was trying to avoid it. Well, if it is the best solution in this moment, then I will try it out.
x
If you don't want to use Real DNS, use the steps I described before to manually simulate the steps the default domain job did to generate a domain suffix and set it to network configmap
r
Hi Guys, @Xipeng Guan @Sean @Boris Bibic give me a hint please we are trying to connect an external docker repository (aws), and at the building stage, such an error occurs. I guess it’s in the settings Kaniko what could it be? Try to update “insecure” options? Or i will try to update docker config? (when I update the token in the config, it changes it every time I build it, I don’t understand where it gets it from ) Or i found this instruction https://github.com/GoogleContainerTools/kaniko#pushing-to-amazon-ecr
Copy code
error checking push permissions -- make sure you entered the correct tag name, and that you are authenticated correctly, and try again: checking push permission for "<http://xxxxx.dkr.ecr.us-east-1.amazonaws.com/dev-zenpulsar-pump-yatai-bentos:yatai.neural.1.2|xxxxx.dkr.ecr.us-east-1.amazonaws.com/dev-zenpulsar-pump-yatai-bentos:yatai.neural.1.2>": Post "<https://xxxxxx.dkr.ecr.us-east-1.amazonaws.com/v2/dev-zenpulsar-pump-yatai-bentos/blobs/uploads/>": EOF
👀 1
Hi Guys, @Xipeng Guan @Sean @Boris Bibic We solve problems) Kaniko required authorization (AWS), and we found secrets that need to be updated constantly. We made a Job () that updates tokens every day and updates in secrets. In 3 different places And the kaniko we had on the server didn’t work until we updated the bento to the branch version. I found more errors when building the image and fixed it as best I could) I make PR https://github.com/bentoml/BentoML/pull/3148
Copy code
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: yatai-ecr-registry-sa
---
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: Role
metadata:
  name: yatai-ecr-registry-role
rules:
- apiGroups: [""]
  resources: ["secrets"]
  resourceNames: ["regcred"]
  verbs: ["delete"]
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["create"]
- apiGroups: [""]
  resources: ["secrets","pods"]
  verbs: ["delete", "list"]
---
kind: RoleBinding
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
metadata:
  name: yatai-ecr-registry-rb
subjects:
- kind: ServiceAccount
  name: yatai-ecr-registry-sa
  apiGroup: ""
roleRef:
  kind: Role
  name: yatai-ecr-registry-role
  apiGroup: ""
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: yatai-ecr-registry-cm
data:
  AWS_REGION: "us-east-1"
  DOCKER_SECRET_NAME: yatai-docker-registry-credentials
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: yatai-ecr-registry-cron
spec:
  schedule: "0 */10 * * *"
  successfulJobsHistoryLimit: 3
  suspend: false
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: yatai-ecr-registry-sa
          containers:
          - name: ecr-registry-helper
            image: odaniait/aws-kubectl:latest
            imagePullPolicy: IfNotPresent
            envFrom:
              - configMapRef:
                  name: yatai-ecr-registry-cm
            command:
              - /bin/sh
              - -c
              - |-
                ECR_TOKEN=`aws ecr get-login-password --region ${AWS_REGION}`
                NAMESPACE_NAME=yatai-deployment
                kubectl delete secret --ignore-not-found $DOCKER_SECRET_NAME -n $NAMESPACE_NAME
                kubectl create secret generic $DOCKER_SECRET_NAME --from-literal=password=$ECR_TOKEN \
                --namespace=$NAMESPACE_NAME
                kubectl delete pods -l <http://app.kubernetes.io/name=yatai-deployment|app.kubernetes.io/name=yatai-deployment>
                echo "Secret was successfully updated at $(date)"
          restartPolicy: Never
👍 1
x
@Rustem Salawat Great work! Very happy that you found a solution, this solution of yours is very valuable and we plan to put it in the documentation later, thanks for your contribution!
👍 2
r
@Xipeng Guan Thank you very much for your help, Boris, @Boris Bibic please clarify and write for what to do with the tokens. Am I the last to post the code to run jobs?
b
Hi @Xipeng Guan, just wanted to add more context and clarify the Jobs that we created to replace an old AWS ECR tokens. There are 2 Jobs running - one in the
yatai-builders
, and one in the
yatai
Namespace. First one updates the
docker-config
Secret like this:
Copy code
- /bin/sh
- -c
- |-
  ECR_TOKEN=`aws ecr get-login-password --region ${AWS_REGION}`
  kubectl delete secret --ignore-not-found $REGCRED_SECRET_NAME
  K8S_SECRET=`kubectl create secret docker-registry $REGCRED_SECRET_NAME \
  --docker-server=$AWS_ECR --docker-password=$ECR_TOKEN --docker-username=AWS \
  --dry-run=client -o jsonpath='{.data.*}' | base64 -d`
  kubectl create secret generic $REGCRED_SECRET_NAME --from-literal=config.json=$K8S_SECRET
  echo "Secret was successfully updated at $(date)"
The second Job updates the
yatai-regcred
Secret with below command inside the Job:
Copy code
- /bin/sh
- -c
- |-
  ECR_TOKEN=`aws ecr get-login-password --region ${AWS_REGION}`
  kubectl delete secret --ignore-not-found $REGCRED_SECRET_NAME                
  kubectl create secret docker-registry $REGCRED_SECRET_NAME --docker-server=$AWS_ECR \
  --docker-password=$ECR_TOKEN --docker-username=AWS
  echo "Secret was successfully updated at $(date)"
Also, there are 3 variables added to the ConfigMap:
AWS_REGION
,
REGCRED_SECRET_NAME
and
AWS_ECR
. Rest of the code is the same as one posted by @Rustem Salawat. The reason for 2 Jobs is simple to explain - in case of one Job, the Job would need to have permissions to access Secrets in differences Namespaces, and that adds complexity, and we were in hurry... The Secrets have different structure, so we have 2 methods for replacing them. I’ve set the Jobs’ schedule to 10 hours, just to be safe. Luckily, we didn’t need to restart any pods.
x
@Boris Bibic Great Job! Thank you!
👍 1