This message was deleted.
# ask-for-help
s
This message was deleted.
b
I dont know, @Sean do you have better of this? @Yilun Zhang curious, why do you want to know this info?
y
I have previously raise questions regarding hosting/moving transformers models/pipeline object to GPU device(s) + also using bentoml configuration yaml file to specify the number of runners to create and distribute to multiple GPUs for a certain model. The previous issue I had was regarding not able to move a transformers pipeline object to GPU because it requires
cuda:[X]
where
X
is the GPU id, and it’s not something I can pass in my service.py file when I wanted to control that through yaml config file. But if I don’t specify to move the pipeline to GPU in my service.py file, it remains in CPU. This is not an issue with tokenizer and model objects separated and I can do something like:
Copy code
<http://model.to|model.to>("cuda")
batch = tokenizer(inputs).to("cuda")
output = model.generate(**batch, ...)
But with above transformers pipeline not working in GPU, I’m not very certain if this code above is working correctly or not. To be more specific, if I have multiple instances of above model runner distributed in GPUs, is the requests routed correctly to all of them. So having information on which runner processed which input query, it gives me more confidence that multiple runners are initiated and moved to GPUs correctly and fully optimized, rather than having some runners just dangling there doing nothing but I’m not aware of them. Or is there other existing ways of knowing requests volumes for each runner?
s
Or is there other existing ways of knowing requests volumes for each runner?
Runners report metrics on the
/metrics
endpoint just like the API Server. If you have prometheus setup, you can perhaps collect metrics from the runners to understand their traffic volumes.
To be more specific, if I have multiple instances of above model runner distributed in GPUs, is the requests routed correctly to all of them.
Request routing is independent of the token locality on GPU. In the code sample you share, if the logic is all run in the same custom runner, we can ensure that the tokens are model are on the same GPU. On the other hand, if you’d like to perform the tokenization in the API Server on CPU and pass the tokens to a custom runner to run on GPU, you will make the call to the runner before moving the tokens on GPU. In the custom runner, you’ll have to move the tokens to GPU before calling the model for inference.
Let me know if I mis-understood your question.
@Aaron Pham ☝️ an interesting Transformers use case that might be interesting to you.
y
@Sean
Runners report metrics on the
/metrics
endpoint just like the API Server. If you have prometheus setup, you can perhaps collect metrics from the runners to understand their traffic volumes.
Thanks for the direction! I have never actively checked /metrics information, maybe it’s time for me to do so for better monitoring!
In the custom runner, you’ll have to move the tokens to GPU before calling the model for inference.
Yes, that’s exactly what I do. A more complete sample code for my custom runner class (this one is just a pytorch model runner but same idea) will be:
Copy code
class CustomRunner(bentoml.Runnable):

    SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>",)
    SUPPORTS_CPU_MULTI_THREADING = False

    def __init__(self):
        self.torch_device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
        bento_model = bentoml.pytorch.get(model_name)
        self.tokenizer = bento_model.custom_objects["tokenizer"]
        self.model = bentoml.pytorch.load_model(bento_model).to(self.torch_device)

    def predict(self, batch):
        src_text = [obj.get("text") for obj in batch]
        tokenized = self.tokenizer(src_text, ..., return_tensors="pt").to(self.torch_device)
        self.model(**tokenized)
        ...
So basically if in my yaml config, if I say I want an instance of this runner on GPU0 and another on GPU1, how will
.to(self.torch_device)
(which value is just
"cuda"
) know which GPU to move the tensors to?
If
transformers
setup, the pipeline has to be moved to a specific GPU device
cuda:x
rather than simply
cuda
, so the same setup will just fail since I won’t be able to move the pipeline to GPU devices that are specified in yaml config.