This message was deleted.
# ask-for-help
s
This message was deleted.
πŸ¦„ 1
🏁 2
replied 1
x
Now we can manually create a
BentoRequest
CR to trigger the image builder. Yes, now we can set the
spec.imageBuilderExtraPodSpec.affinity
of the
BentoRequest
CR to specify the node affinity for the image builder pod.
g
I see, so now the steps to deploy with yatai would be: 1.
bentoml build
2.
bentoml push $bentoname
3. create
BentoRequest
CR and run
kubectl apply
4. create
BentoDeployment
CR and run
kubectl apply
Is that correct?
x
Yes, it's the right step
πŸ‘ 1
g
Hi @Xipeng Guan If possible I need your help to investigate why the gpu is not being used during inference. I can see that the container is starting in the required node with the required gpu device attached and accessible within the container. But from the logs, memory allocation, and the time it takes for inference it is clear that the gpu is not utilized during inference. Not sure where I am getting it wrong
Following is the content of bento.yaml file
Copy code
service: "src.service:svc"
labels:
   owner: plaetos
   stage: dev
include:
- "service.py"
- "config.py"
python:
    requirements_txt: "./requirements.txt"
docker:
    python_version: "3.7"
    cuda_version: "11.2"
and the requirements.txt
Copy code
-f <https://download.pytorch.org/whl/cu113>
torch==1.12.0
torchvision==0.13.0
torchaudio==0.12.0
pandas==1.1.4
transformers[sentencepiece]==4.23.1
jax==0.3.24
flax==0.3.3
minio
python-dotenv
Following is the snippet from
service.py
containing the runner instance configuration
Copy code
import bentoml
from <http://bentoml.io|bentoml.io> import JSON
import torch
from .config import MODEL_NAME, MODEL_VERSION, SERVICE_NAME
import pandas as pd
#from <http://bentoml.io|bentoml.io> import PandasDataFrame

class SentimentRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True

    def __init__(self):
        self.model = bentoml.transformers.load_model(sentiment_model)
        print(f"self.model: {self.model}")
         
    @bentoml.Runnable.method(batchable=True)
    def __call__(self, input_text):
        return self.model(input_text)


sentiment_model = bentoml.transformers.get(f"{MODEL_NAME}:{MODEL_VERSION}")

sentiment_runner = bentoml.Runner(
     SentimentRunnable,
     name="sentimentrunner_v1",
     models=[sentiment_model]
 )
I should also state that I only used
BentoRequest
CR to create the bento image, but didn't use
BentoDeployment
CR as we are using knative to scale containers down to zero. Following is the knative service spec for reference to verify the args passed:
Copy code
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: sentiment-3grade-service-1-bbd-12
  namespace: default
spec:
  template:
    spec:
      containers:
      - args:
        - --api-workers=1
        - --production
        - --port=5000
        env:
        - name: MODEL_NAME
          value: sentiment-3grade-model-v1
        - name: MODEL_VERSION
          value: latest
        - name: SERVICE_NAME
          value: sentiment-3grade-service
        - name: GPU_ENABLED
          value: 'true'
        image: <path to image created by yatai-image-builder>
        livenessProbe:
          httpGet:
            path: /healthz
          initialDelaySeconds: 3
          periodSeconds: 5
        ports:
        - containerPort: 5000
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
          initialDelaySeconds: 3
          periodSeconds: 5
          timeoutSeconds: 300
        resources:
          limits:
            <http://nvidia.com/gpu|nvidia.com/gpu>: 1
      tolerations:
      - effect: NoSchedule
        key: gpu
        operator: Equal
        value: "true"
j
Hi @Ghawady Ehmaid

https://files.slack.com/files-pri/TK999HSJU-F04J9M1SCCE/image.pngβ–Ύ

Is this container the API server our the runner sentimentrunner_v1
g
Hi @Jiang, there is actually one container at the moment so both the API server and runner are running in one container. It is currently this way, because we tweaked the existing script we had while running bento-v0.10 which didn't have the runner concept. If you can guide me through what would be the args to pass to start the runner mode only or start the api server only to separate them in different containers, that would be great!
btw this is the log from that single container and shows both api server & runner logs if it is useful
j
Hi @Ghawady Ehmaid. The entry-point and supervisor seems correct.
Copy code
import bentoml
from <http://bentoml.io|bentoml.io> import JSON
import torch
from .config import MODEL_NAME, MODEL_VERSION, SERVICE_NAME
import pandas as pd
#from <http://bentoml.io|bentoml.io> import PandasDataFrame

class SentimentRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True

    def __init__(self):
        self.model = bentoml.transformers.load_model(sentiment_model)
        print(f"self.model: {self.model}")
         
    @bentoml.Runnable.method(batchable=True)
    def __call__(self, input_text):
        return self.model(input_text)


sentiment_model = bentoml.transformers.get(f"{MODEL_NAME}:{MODEL_VERSION}")

sentiment_runner = bentoml.Runner(
     SentimentRunnable,
     name="sentimentrunner_v1",
     models=[sentiment_model]
 )
The customize runner doesn't support GPU inference.
Copy code
class SentimentRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True

    def __init__(self):
        if <check if gpu exists>:
            self.model = bentoml.transformers.load_model(sentiment_model, device=0)
        else:
            self.model = bentoml.transformers.load_model(sentiment_model, device=-1)
        print(f"self.model: {self.model}")
πŸ™Œ 1
It should be like this.
πŸ‘ 1
g
Ohh thank you for pointing this out! I wouldn't have known πŸ™Œ
j
May I ask why you don't use the
bentoml.transformers
runner?
g
My answer would be naively stating that we followed the example in the documentation. I saw with the custom runner example we can set the SUPPORTED_RESOURCES and thought that is how it needs to be done πŸ˜… https://docs.bentoml.org/en/latest/frameworks/transformers.html#pretrained-models
j
@Ghawady Ehmaid I see. Thanks, very informative for us to improve our doc
🍱 1
g
Again thanks for asking this question, I am having a look at
TransformersRunnable
class and don't see a reason for not using it. This is better than creating a custom runner that does the same thing
🍻 1