This message was deleted.
# ask-for-help
s
This message was deleted.
b
I think we need to have a better way to “preheat” the endpoint with sample request. Would love to hear @Sean opinion
Btw, love to have your feedback on the triton draft PR: https://github.com/bentoml/BentoML/pull/3471
🙌 2
👀 1
s
A good practice is to rely on the health check endpoint to determine if the api and runner servers are ready, and only send requests when the health check returns 200.
b
is it possible to overwrite the
/ready
with a user-specified one in the Bento service?
j
Yup, this is exactly what I ended up doing. I needed a custom path for my health check, so in that path I just added a request to the /healthz endpoint. Even with that, it seems to be slow on the first real request to my /classify endpoint though. Would /readyz be better?
b
how many runners do you have?
i realized that if I have N runners I seem to need N requests to get it all warmed up (makes sense though)
j
2 of them. I’ll have to test that out. I could try adding a second request to the health check and see if it makes the first real request quicker
b
awesome! tell me if it works ❤️
c
It could also be related to some ML frameworks internal lazy loading behavior. What ML framework are you using with BentoML?
j
Right now I'm using Transformers, just pre-trained models from HuggingFace
I created a couple of custom transformers pipelines (to do some extra pre- and post-processing) around their BEiT and CLIP models
c
Could you share related service definition code and model saving code?
j
Sure. I sanitized a few proprietary things, but here's a Gist with the 2 scripts to save the Bento models and the service.py file: https://gist.github.com/akuma12/b4443a0103b2dc8b661b1bdb6d61e6ee
Semi-unrelated question, but is it possible to run multiple runner processes without using Yatai? Like run 2 copies of each of my runners. Not sure if it would be a performance gain or not, but I have plenty of GPU memory to spare
b
If you deploy a
BentoDeployment
, you can specify any number of runners you want
i.e. write a
BentoDeployment
yaml and
kubectl create -f mybentodeployment.yaml
c
yes it is possible to configure the number of replicas for each runner within a single container (no kubernetes or yatai)
j
I don't suppose you can elaborate on how to do that? I see the
autoscaling
section of the Yatai BentoDeployment, but that doesn't seem to work inside of the bentoml_configuration.yaml file.
I should mention that the instances I'm running on have a single GPU.