BentoML

Do you mean multiple copies of the same runner, or separate runners?

I have the same question as Jim.  Also could you tell us a little more on the context of this ask?

What reason is driving you to look for best practice, and what use cases do you have in mind?

Sorry for missing your response on Slack. We currently have 8 NVIDIA GeForce RTX 3090 graphics cards, but we need to support around 50 models (which I have now turned into 50 Bento Runners). However, my hardware resources cannot load all the models onto the graphics cards in advance. What is the most efficient way to schedule these runners?

Currently, I am not using any scheduling strategy, and when the graphics cards are fully occupied, the requested models will be processed through CPU inference.

Can I manually add or remove Bento Runners on a Bento Service so that I can manually schedule GPU resources?