This message was deleted BentoML #ask-for-help

Join Slack

This message was deleted.

# ask-for-help

Slackbot

02/06/2023, 9:43 PM

This message was deleted.

Chaoyu

02/06/2023, 9:46 PM

yea I think BentoML could help in this scenarios. Are those 5 models just serving different type of traffic? Or are you building some sort of inference graph, say running those 5 models sequentially?

Tim

02/06/2023, 9:55 PM

Those 5 models are just 5 different sizes of custom trained GPT models. So they are serving different kind of traffic. The problem is now when 1 model is being called by some client, the other models are blocked until execution of first model is done.

Tim

02/06/2023, 9:55 PM

my naive way of solving this was first to just spin up 5 different FastAPIs, but I thought id give some other thought to it…

Tim

02/06/2023, 9:56 PM

Since in that case, i would get 5 different ports. It want the user to only need to care about 1 url to an external api.

Yilun Zhang

02/06/2023, 10:13 PM

I think this is exactly what BentoML can help achieve and it is something I’m already doing for a couple of production models my company is hosting. For my case since I have rather complicated preprocessing and postprocessing logics, I can’t do everything within bentoml server, so I have a python flask server hosted with gunicorn, then a bentoml server hosting all the ml models I use and the flask server will be sending requests to the local bentoml model server for inference. If your pre/post processing part are rather simple, you probably only need to host a bentoml server and create 5 different endpoints (i.e. /api/model1, /api/model2, etc.). You will only have 1 port for your service and you can run those models asynchronously.

Tim

02/06/2023, 10:19 PM

sounds very similar! but will not your Flask app with Gunicorn be blocked while some model is being used for inference on the bentoml server until it gets a response?

Yilun Zhang

02/06/2023, 10:34 PM

This depends on the actual volume of requests and how much resource you let your flask/gunicorn app consume. For example, for our case, we have 2 gunicorn flask servers running the flask app and a load balancer on top of it to distribute the requests and 2 bentoml model servers on each server as well assigned to each flask server. At least we haven’t reached that limitation yet, but it’s definitely something to consider when allocating resources. It’s less of an issue if the model inference time is low, but for our case, with our complex pre/post processing logic and some early exiting rules defined, it’s hard to embed all of those into bentoml server.

👀 1

Tim

02/06/2023, 10:36 PM

Interesting! Is the load balancer setup with e.g NGINX or?

Yilun Zhang

02/06/2023, 10:46 PM

That I’m not sure, it’s something setup by out IT 😛. I believe there are a lot of options to do this.

Tim

02/06/2023, 10:55 PM

Ok! 🙂 Do you know what happens with e.g running Gunicorn with several workers, does a new request know which worker to pick? (would that not be the same as a load balancer) ?

Yilun Zhang

02/06/2023, 10:57 PM

I guess you can read more about how gunicorn work in general. Basically speaking, there’s a master gunicorn process that will spawn all the workers + routing incoming requests to a random worker. There’s no difference between each workers other than them having their own pid in system.

Tim

02/06/2023, 11:00 PM

Ok, routing to a random walker sounds like its not considering the load of the workers 🙂 I guess thats where a load balancer can come in

Yilun Zhang

02/06/2023, 11:32 PM

Well each worker is essentially a single process, so the master process will look for idle worker to send requests to if available.

Tim

02/06/2023, 11:52 PM

Thanks for sharing the knowledge 🙂

12 Views

Open in Slack

Previous Next