This message was deleted BentoML #ask-for-help

Join Slack

This message was deleted.

# ask-for-help

Slackbot

02/06/2023, 10:29 PM

This message was deleted.

Chaoyu

02/06/2023, 10:37 PM

Hi @Yilun Zhang, do you want to test just the runner itself? How did you setup the tests?

Chaoyu

02/06/2023, 10:38 PM

If you are calling

init_local

from a standalone python process, the default runner implementation in BentoML will respect the CUDA_VISIBLE_DEVICES env var and load the model to the GPU if GPU is available

Chaoyu

02/06/2023, 10:39 PM

If you are using a custom runner, then it will depend on the runner’s implementation

Yilun Zhang

02/06/2023, 10:45 PM

Ohh perfect, it’s working. Yeah I was just using

bentoml.transformers

pipeline, just didn’t know that I need to specify this

CUDA_VISIBLE_DEVICES

variable explicitly since usually it’s default to all GPUs are visible. However after this testing, I noticed that for my token classification model, I get ~5ms runtime when using the pipeline directly (in transformers) and testing in locally initialized runner (with GPU). But when I spin up the bnetoml server with

BENTOML_CONFIG=config_bentoml/bentoml_configuration_dev.yaml bentoml serve service.py:svc

with the same runner, It’s taking ~50ms despite me instruct the runner to use GPU in config.

Copy code

my_runner:
    resources:
      cpu: 1
      <http://nvidia.com/gpu|nvidia.com/gpu>: [6, 6]
    batching:
      enabled: False
      max_batch_size: 64
      max_latency_ms: 20000

I can confirm that the runner names are matched. Also disabled batching so it should be spending time waiting for other future inputs. This time difference made me wonder if the runner is really assigned to GPU or not, is there a way to confirm this by checking the location of the model during a request?

Chaoyu

02/07/2023, 12:13 AM

They easiest way is probably using Nvidia-smi to check if the model is loaded to the GPU

Chaoyu

02/07/2023, 12:13 AM

how did you measure the 50ms latency?

Chaoyu

02/07/2023, 12:13 AM

is it the overall inference time for the request?

Yilun Zhang

02/07/2023, 12:57 AM

I measure the total time lapse for `runner.run()`:

Copy code

t0 = time.perf_counter()
result = runner.run(data)
t1 = time.perf_counter()

Yilun Zhang

02/07/2023, 1:18 AM

So I guess there’s no way to check this from runner object. I will check tmr to see if the GPU is assigned correctly for the runner.

Yilun Zhang

02/07/2023, 1:43 AM

@Chaoyu So I checked if the runner is on GPU or not and indeed it’s not moved to GPU. I think I have a very basic setup here:

Copy code

my_model = bentoml.transformers.get(my_model_name)
my_runner = my_model.to_runner(name="my_runner")

Then in my config, I have:

Copy code

runners:
  my_runner:
    resources:
      cpu: 1
      <http://nvidia.com/gpu|nvidia.com/gpu>: [6, 6]
    batching:
      enabled: False
      max_batch_size: 64
      max_latency_ms: 20000

Do you spot any obvious error here?

Yilun Zhang

02/07/2023, 3:10 AM

Note, it’s interesting that all the runners from my custom runnable class are running on gpu with no problem (not using pipeline though), but it’s this pipeline model that’s the only one not running on GPU correctly (using bentoml config file).

Yilun Zhang

02/07/2023, 1:59 PM

@Chaoyu I think I found the issue. In the class definition of TransformersRunnable, you are explicitly checking for

CUDA_VISIBLE_DEVICES

and will ignore if it doesn’t exist. But when I spin up services, I use bentoml config file to set which runner goes to which gpu device, so I don’t need to specify this environment variable. And I feel like there’s some conflict here (maybe it’s designed this way? I’m not sure). It would be nicer if the init code will pick up the gpu device if

CUDA_VISIBLE_DEVICES

is not specified. I get error when specifying and/or

CUDA_VISIBLE_DEVICES

(not sure what the error is since there’s no debug information even with

--debug

. The service keeps restarting. When I specify

CUDA_VISIBLE_DEVICES=6

, it’s showing in log that the device arg is passed:

Copy code

2023-02-07 14:26:10,343 - bentoml._internal.frameworks.transformers - INFO - Loading 'token-classification' pipeline 'my_model:latest' with kwargs {'device': 6}.

But probably there’s some error happening causing the service to restart repeatedly.

Yilun Zhang

02/07/2023, 5:30 PM

Maybe I can submit a bug to github later with fully reproducible code sample for this issue.

3 Views

Open in Slack

Previous Next