Slackbot
02/06/2023, 10:29 PMChaoyu
02/06/2023, 10:37 PMChaoyu
02/06/2023, 10:38 PMinit_local
from a standalone python process, the default runner implementation in BentoML will respect the CUDA_VISIBLE_DEVICES env var and load the model to the GPU if GPU is availableChaoyu
02/06/2023, 10:39 PMYilun Zhang
02/06/2023, 10:45 PMbentoml.transformers
pipeline, just didn’t know that I need to specify this CUDA_VISIBLE_DEVICES
variable explicitly since usually it’s default to all GPUs are visible.
However after this testing, I noticed that for my token classification model, I get ~5ms runtime when using the pipeline directly (in transformers) and testing in locally initialized runner (with GPU). But when I spin up the bnetoml server with BENTOML_CONFIG=config_bentoml/bentoml_configuration_dev.yaml bentoml serve service.py:svc
with the same runner, It’s taking ~50ms despite me instruct the runner to use GPU in config.
my_runner:
resources:
cpu: 1
<http://nvidia.com/gpu|nvidia.com/gpu>: [6, 6]
batching:
enabled: False
max_batch_size: 64
max_latency_ms: 20000
I can confirm that the runner names are matched. Also disabled batching so it should be spending time waiting for other future inputs.
This time difference made me wonder if the runner is really assigned to GPU or not, is there a way to confirm this by checking the location of the model during a request?Chaoyu
02/07/2023, 12:13 AMChaoyu
02/07/2023, 12:13 AMChaoyu
02/07/2023, 12:13 AMYilun Zhang
02/07/2023, 12:57 AMt0 = time.perf_counter()
result = runner.run(data)
t1 = time.perf_counter()
Yilun Zhang
02/07/2023, 1:18 AMYilun Zhang
02/07/2023, 1:43 AMmy_model = bentoml.transformers.get(my_model_name)
my_runner = my_model.to_runner(name="my_runner")
Then in my config, I have:
runners:
my_runner:
resources:
cpu: 1
<http://nvidia.com/gpu|nvidia.com/gpu>: [6, 6]
batching:
enabled: False
max_batch_size: 64
max_latency_ms: 20000
Do you spot any obvious error here?Yilun Zhang
02/07/2023, 3:10 AMYilun Zhang
02/07/2023, 1:59 PMCUDA_VISIBLE_DEVICES
and will ignore if it doesn’t exist. But when I spin up services, I use bentoml config file to set which runner goes to which gpu device, so I don’t need to specify this environment variable. And I feel like there’s some conflict here (maybe it’s designed this way? I’m not sure). It would be nicer if the init code will pick up the gpu device if CUDA_VISIBLE_DEVICES
is not specified.
I get error when specifying and/or CUDA_VISIBLE_DEVICES
(not sure what the error is since there’s no debug information even with --debug
. The service keeps restarting.
When I specify CUDA_VISIBLE_DEVICES=6
, it’s showing in log that the device arg is passed:
2023-02-07 14:26:10,343 - bentoml._internal.frameworks.transformers - INFO - Loading 'token-classification' pipeline 'my_model:latest' with kwargs {'device': 6}.
But probably there’s some error happening causing the service to restart repeatedly.Yilun Zhang
02/07/2023, 5:30 PM