Hi
@Udijs Edijs Musajevs - I assume you are deploying the container on a single GPU machine currently. In that case, there’s only one Runner instance, thus multiple requests will essentially be queued and wait for execution. As far as I know, most stable diffusion implementation today doesn’t support batching, so adaptive batching in BentoML won’t help here neither. you have a few options to scale this:
1. If you have a big enough GPU, you can try to run multiple Runner instances on the same GPU. Or use multiple GPU on the same host. Both can be configured by setting a resource scheduling strategy. See notes here:
https://docs.bentoml.org/en/latest/guides/scheduling.html
2. If you are deploying on Kubernetes, checkout Yatai
https://github.com/bentoml/yatai, which spins out Runner instances as their own micro services and auto-scaling groups. You can configure it to scale the replica count of the model runner, based on traffic.
3. If you want something easier to deploy and don’t mind using a commercial platform, BentoCloud
https://www.bentoml.com/ makes it super easy to set up serverless endpoint that scales up when traffic is high, scales to zero when there’s no traffic, and only charge by the compute you use. BentoCloud is still invite only, let me know if you’re interested!