This message was deleted BentoML #ask-for-help

Join Slack

This message was deleted.

# ask-for-help

Slackbot

01/09/2023, 9:20 PM

This message was deleted.

🦄 1

🏁 2

replied 1

Xipeng Guan

01/10/2023, 2:09 AM

Now we can manually create a

BentoRequest

CR to trigger the image builder. Yes, now we can set the

spec.imageBuilderExtraPodSpec.affinity

of the

BentoRequest

CR to specify the node affinity for the image builder pod.

Ghawady Ehmaid

01/10/2023, 3:51 AM

I see, so now the steps to deploy with yatai would be: 1.

bentoml build

bentoml push $bentoname

3. create

BentoRequest

CR and run

kubectl apply

4. create

BentoDeployment

CR and run

kubectl apply

Is that correct?

Xipeng Guan

01/10/2023, 3:52 AM

Yes, it's the right step

👍 1

Ghawady Ehmaid

01/11/2023, 5:46 PM

Hi @Xipeng Guan If possible I need your help to investigate why the gpu is not being used during inference. I can see that the container is starting in the required node with the required gpu device attached and accessible within the container. But from the logs, memory allocation, and the time it takes for inference it is clear that the gpu is not utilized during inference. Not sure where I am getting it wrong

Ghawady Ehmaid

01/11/2023, 5:52 PM

Following is the content of bento.yaml file

Copy code

service: "src.service:svc"
labels:
   owner: plaetos
   stage: dev
include:
- "service.py"
- "config.py"
python:
    requirements_txt: "./requirements.txt"
docker:
    python_version: "3.7"
    cuda_version: "11.2"

and the requirements.txt

Copy code

-f <https://download.pytorch.org/whl/cu113>
torch==1.12.0
torchvision==0.13.0
torchaudio==0.12.0
pandas==1.1.4
transformers[sentencepiece]==4.23.1
jax==0.3.24
flax==0.3.3
minio
python-dotenv

Following is the snippet from

service.py

containing the runner instance configuration

Copy code

import bentoml
from <http://bentoml.io|bentoml.io> import JSON
import torch
from .config import MODEL_NAME, MODEL_VERSION, SERVICE_NAME
import pandas as pd
#from <http://bentoml.io|bentoml.io> import PandasDataFrame

class SentimentRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True

    def __init__(self):
        self.model = bentoml.transformers.load_model(sentiment_model)
        print(f"self.model: {self.model}")
         
    @bentoml.Runnable.method(batchable=True)
    def __call__(self, input_text):
        return self.model(input_text)


sentiment_model = bentoml.transformers.get(f"{MODEL_NAME}:{MODEL_VERSION}")

sentiment_runner = bentoml.Runner(
     SentimentRunnable,
     name="sentimentrunner_v1",
     models=[sentiment_model]
 )

Ghawady Ehmaid

01/11/2023, 6:31 PM

I should also state that I only used

BentoRequest

CR to create the bento image, but didn't use

BentoDeployment

CR as we are using knative to scale containers down to zero. Following is the knative service spec for reference to verify the args passed:

Copy code

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: sentiment-3grade-service-1-bbd-12
  namespace: default
spec:
  template:
    spec:
      containers:
      - args:
        - --api-workers=1
        - --production
        - --port=5000
        env:
        - name: MODEL_NAME
          value: sentiment-3grade-model-v1
        - name: MODEL_VERSION
          value: latest
        - name: SERVICE_NAME
          value: sentiment-3grade-service
        - name: GPU_ENABLED
          value: 'true'
        image: <path to image created by yatai-image-builder>
        livenessProbe:
          httpGet:
            path: /healthz
          initialDelaySeconds: 3
          periodSeconds: 5
        ports:
        - containerPort: 5000
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
          initialDelaySeconds: 3
          periodSeconds: 5
          timeoutSeconds: 300
        resources:
          limits:
            <http://nvidia.com/gpu|nvidia.com/gpu>: 1
      tolerations:
      - effect: NoSchedule
        key: gpu
        operator: Equal
        value: "true"

Jiang

01/12/2023, 2:09 PM

Hi @Ghawady Ehmaid

Jiang

01/12/2023, 2:10 PM

https://files.slack.com/files-pri/TK999HSJU-F04J9M1SCCE/image.png▾

Jiang

01/12/2023, 2:10 PM

Is this container the API server our the runner sentimentrunner_v1

Ghawady Ehmaid

01/12/2023, 6:40 PM

Hi @Jiang, there is actually one container at the moment so both the API server and runner are running in one container. It is currently this way, because we tweaked the existing script we had while running bento-v0.10 which didn't have the runner concept. If you can guide me through what would be the args to pass to start the runner mode only or start the api server only to separate them in different containers, that would be great!

Ghawady Ehmaid

01/12/2023, 6:43 PM

btw this is the log from that single container and shows both api server & runner logs if it is useful

Jiang

01/13/2023, 3:13 AM

Hi @Ghawady Ehmaid. The entry-point and supervisor seems correct.

Jiang

01/13/2023, 3:13 AM

Copy code

import bentoml
from <http://bentoml.io|bentoml.io> import JSON
import torch
from .config import MODEL_NAME, MODEL_VERSION, SERVICE_NAME
import pandas as pd
#from <http://bentoml.io|bentoml.io> import PandasDataFrame

class SentimentRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True

    def __init__(self):
        self.model = bentoml.transformers.load_model(sentiment_model)
        print(f"self.model: {self.model}")
         
    @bentoml.Runnable.method(batchable=True)
    def __call__(self, input_text):
        return self.model(input_text)


sentiment_model = bentoml.transformers.get(f"{MODEL_NAME}:{MODEL_VERSION}")

sentiment_runner = bentoml.Runner(
     SentimentRunnable,
     name="sentimentrunner_v1",
     models=[sentiment_model]
 )

The customize runner doesn't support GPU inference.

Jiang

01/13/2023, 3:15 AM

Copy code

class SentimentRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True

    def __init__(self):
        if <check if gpu exists>:
            self.model = bentoml.transformers.load_model(sentiment_model, device=0)
        else:
            self.model = bentoml.transformers.load_model(sentiment_model, device=-1)
        print(f"self.model: {self.model}")

🙌 1

Jiang

01/13/2023, 3:15 AM

It should be like this.

Jiang

01/13/2023, 3:16 AM

https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/frameworks/transformers.py#L452 You can refer to our implementation of

TransformersRunnable

👍 1

Ghawady Ehmaid

01/13/2023, 3:26 AM

Ohh thank you for pointing this out! I wouldn't have known 🙌

Jiang

01/13/2023, 3:28 AM

May I ask why you don't use the

bentoml.transformers

runner?

Ghawady Ehmaid

01/13/2023, 3:38 AM

My answer would be naively stating that we followed the example in the documentation. I saw with the custom runner example we can set the SUPPORTED_RESOURCES and thought that is how it needs to be done 😅 https://docs.bentoml.org/en/latest/frameworks/transformers.html#pretrained-models

Jiang

01/13/2023, 3:42 AM

@Ghawady Ehmaid I see. Thanks, very informative for us to improve our doc

🍱 1

Ghawady Ehmaid

01/13/2023, 3:43 AM

Again thanks for asking this question, I am having a look at

TransformersRunnable

class and don't see a reason for not using it. This is better than creating a custom runner that does the same thing

🍻 1

3 Views

Open in Slack

Previous Next