This message was deleted.
# ask-for-help
s
This message was deleted.
c
yea I think BentoML could help in this scenarios. Are those 5 models just serving different type of traffic? Or are you building some sort of inference graph, say running those 5 models sequentially?
t
Those 5 models are just 5 different sizes of custom trained GPT models. So they are serving different kind of traffic. The problem is now when 1 model is being called by some client, the other models are blocked until execution of first model is done.
my naive way of solving this was first to just spin up 5 different FastAPIs, but I thought id give some other thought to it…
Since in that case, i would get 5 different ports. It want the user to only need to care about 1 url to an external api.
y
I think this is exactly what BentoML can help achieve and it is something I’m already doing for a couple of production models my company is hosting. For my case since I have rather complicated preprocessing and postprocessing logics, I can’t do everything within bentoml server, so I have a python flask server hosted with gunicorn, then a bentoml server hosting all the ml models I use and the flask server will be sending requests to the local bentoml model server for inference. If your pre/post processing part are rather simple, you probably only need to host a bentoml server and create 5 different endpoints (i.e. /api/model1, /api/model2, etc.). You will only have 1 port for your service and you can run those models asynchronously.
t
sounds very similar! but will not your Flask app with Gunicorn be blocked while some model is being used for inference on the bentoml server until it gets a response?
y
This depends on the actual volume of requests and how much resource you let your flask/gunicorn app consume. For example, for our case, we have 2 gunicorn flask servers running the flask app and a load balancer on top of it to distribute the requests and 2 bentoml model servers on each server as well assigned to each flask server. At least we haven’t reached that limitation yet, but it’s definitely something to consider when allocating resources. It’s less of an issue if the model inference time is low, but for our case, with our complex pre/post processing logic and some early exiting rules defined, it’s hard to embed all of those into bentoml server.
👀 1
t
Interesting! Is the load balancer setup with e.g NGINX or?
y
That I’m not sure, it’s something setup by out IT 😛. I believe there are a lot of options to do this.
t
Ok! 🙂 Do you know what happens with e.g running Gunicorn with several workers, does a new request know which worker to pick? (would that not be the same as a load balancer) ?
y
I guess you can read more about how gunicorn work in general. Basically speaking, there’s a master gunicorn process that will spawn all the workers + routing incoming requests to a random worker. There’s no difference between each workers other than them having their own pid in system.
t
Ok, routing to a random walker sounds like its not considering the load of the workers 🙂 I guess thats where a load balancer can come in
y
Well each worker is essentially a single process, so the master process will look for idle worker to send requests to if available.
t
Thanks for sharing the knowledge 🙂