Slackbot
03/20/2023, 3:44 PMBo
03/20/2023, 9:35 PMYilun Zhang
03/20/2023, 9:42 PMcuda:[X]
where X
is the GPU id, and it’s not something I can pass in my service.py file when I wanted to control that through yaml config file. But if I don’t specify to move the pipeline to GPU in my service.py file, it remains in CPU.
This is not an issue with tokenizer and model objects separated and I can do something like:
<http://model.to|model.to>("cuda")
batch = tokenizer(inputs).to("cuda")
output = model.generate(**batch, ...)
But with above transformers pipeline not working in GPU, I’m not very certain if this code above is working correctly or not. To be more specific, if I have multiple instances of above model runner distributed in GPUs, is the requests routed correctly to all of them.
So having information on which runner processed which input query, it gives me more confidence that multiple runners are initiated and moved to GPUs correctly and fully optimized, rather than having some runners just dangling there doing nothing but I’m not aware of them.
Or is there other existing ways of knowing requests volumes for each runner?Sean
03/21/2023, 9:21 AMOr is there other existing ways of knowing requests volumes for each runner?Runners report metrics on the
/metrics
endpoint just like the API Server. If you have prometheus setup, you can perhaps collect metrics from the runners to understand their traffic volumes.Sean
03/21/2023, 9:39 AMTo be more specific, if I have multiple instances of above model runner distributed in GPUs, is the requests routed correctly to all of them.Request routing is independent of the token locality on GPU. In the code sample you share, if the logic is all run in the same custom runner, we can ensure that the tokens are model are on the same GPU. On the other hand, if you’d like to perform the tokenization in the API Server on CPU and pass the tokens to a custom runner to run on GPU, you will make the call to the runner before moving the tokens on GPU. In the custom runner, you’ll have to move the tokens to GPU before calling the model for inference.
Sean
03/21/2023, 9:39 AMSean
03/21/2023, 9:40 AMYilun Zhang
03/21/2023, 3:05 PMRunners report metrics on theThanks for the direction! I have never actively checked /metrics information, maybe it’s time for me to do so for better monitoring!endpoint just like the API Server. If you have prometheus setup, you can perhaps collect metrics from the runners to understand their traffic volumes./metrics
In the custom runner, you’ll have to move the tokens to GPU before calling the model for inference.Yes, that’s exactly what I do. A more complete sample code for my custom runner class (this one is just a pytorch model runner but same idea) will be:
class CustomRunner(bentoml.Runnable):
SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>",)
SUPPORTS_CPU_MULTI_THREADING = False
def __init__(self):
self.torch_device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
bento_model = bentoml.pytorch.get(model_name)
self.tokenizer = bento_model.custom_objects["tokenizer"]
self.model = bentoml.pytorch.load_model(bento_model).to(self.torch_device)
def predict(self, batch):
src_text = [obj.get("text") for obj in batch]
tokenized = self.tokenizer(src_text, ..., return_tensors="pt").to(self.torch_device)
self.model(**tokenized)
...
So basically if in my yaml config, if I say I want an instance of this runner on GPU0 and another on GPU1, how will .to(self.torch_device)
(which value is just "cuda"
) know which GPU to move the tensors to?Yilun Zhang
03/21/2023, 3:06 PMtransformers
setup, the pipeline has to be moved to a specific GPU device cuda:x
rather than simply cuda
, so the same setup will just fail since I won’t be able to move the pipeline to GPU devices that are specified in yaml config.