This message was deleted.
# ask-for-help
s
This message was deleted.
c
Hi @Yilun Zhang, do you want to test just the runner itself? How did you setup the tests?
If you are calling
init_local
from a standalone python process, the default runner implementation in BentoML will respect the CUDA_VISIBLE_DEVICES env var and load the model to the GPU if GPU is available
If you are using a custom runner, then it will depend on the runner’s implementation
y
Ohh perfect, it’s working. Yeah I was just using
bentoml.transformers
pipeline, just didn’t know that I need to specify this
CUDA_VISIBLE_DEVICES
variable explicitly since usually it’s default to all GPUs are visible. However after this testing, I noticed that for my token classification model, I get ~5ms runtime when using the pipeline directly (in transformers) and testing in locally initialized runner (with GPU). But when I spin up the bnetoml server with
BENTOML_CONFIG=config_bentoml/bentoml_configuration_dev.yaml bentoml serve service.py:svc
with the same runner, It’s taking ~50ms despite me instruct the runner to use GPU in config.
Copy code
my_runner:
    resources:
      cpu: 1
      <http://nvidia.com/gpu|nvidia.com/gpu>: [6, 6]
    batching:
      enabled: False
      max_batch_size: 64
      max_latency_ms: 20000
I can confirm that the runner names are matched. Also disabled batching so it should be spending time waiting for other future inputs. This time difference made me wonder if the runner is really assigned to GPU or not, is there a way to confirm this by checking the location of the model during a request?
c
They easiest way is probably using Nvidia-smi to check if the model is loaded to the GPU
how did you measure the 50ms latency?
is it the overall inference time for the request?
y
I measure the total time lapse for `runner.run()`:
Copy code
t0 = time.perf_counter()
result = runner.run(data)
t1 = time.perf_counter()
So I guess there’s no way to check this from runner object. I will check tmr to see if the GPU is assigned correctly for the runner.
@Chaoyu So I checked if the runner is on GPU or not and indeed it’s not moved to GPU. I think I have a very basic setup here:
Copy code
my_model = bentoml.transformers.get(my_model_name)
my_runner = my_model.to_runner(name="my_runner")
Then in my config, I have:
Copy code
runners:
  my_runner:
    resources:
      cpu: 1
      <http://nvidia.com/gpu|nvidia.com/gpu>: [6, 6]
    batching:
      enabled: False
      max_batch_size: 64
      max_latency_ms: 20000
Do you spot any obvious error here?
Note, it’s interesting that all the runners from my custom runnable class are running on gpu with no problem (not using pipeline though), but it’s this pipeline model that’s the only one not running on GPU correctly (using bentoml config file).
@Chaoyu I think I found the issue. In the class definition of TransformersRunnable, you are explicitly checking for
CUDA_VISIBLE_DEVICES
and will ignore if it doesn’t exist. But when I spin up services, I use bentoml config file to set which runner goes to which gpu device, so I don’t need to specify this environment variable. And I feel like there’s some conflict here (maybe it’s designed this way? I’m not sure). It would be nicer if the init code will pick up the gpu device if
CUDA_VISIBLE_DEVICES
is not specified. I get error when specifying and/or
CUDA_VISIBLE_DEVICES
(not sure what the error is since there’s no debug information even with
--debug
. The service keeps restarting. When I specify
CUDA_VISIBLE_DEVICES=6
, it’s showing in log that the device arg is passed:
Copy code
2023-02-07 14:26:10,343 - bentoml._internal.frameworks.transformers - INFO - Loading 'token-classification' pipeline 'my_model:latest' with kwargs {'device': 6}.
But probably there’s some error happening causing the service to restart repeatedly.
Maybe I can submit a bug to github later with fully reproducible code sample for this issue.