BentoML

Hello, according to docs it is possible. You may check out this section of docs: <https://docs.bentoml.org/en/latest/concepts/runner.html#reusable-runnable>

I see - but the service would still need to manually call a specific named runner for prediction (`runner.run`)?

Is there a way to have bento handle the calling of multiple of these in say, a rond robin fashion?

this comment mentions runner replica counts <https://github.com/bentoml/BentoML/issues/1474#issuecomment-1016025685> but I can't see any reference to that in the docs

May I learn why do you want to run as round robin style?

we've been experimenting with ray serve vs bentoml and have found that given our test model, we needed to deploy two actors to the same GPU with ray serve to get the throughput we wanted

we're seeing lower throughput with bento so were wondering if there was a similar mechanism to employ

From what I can tell, the default strategy is to deploy a runner replica per GPU (<https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/runner/strategy.py>)

Have you ever seen the adaptive batching? <https://docs.bentoml.org/en/latest/guides/batching.html>

<@U04BFG9QEG4>, you can deploy two instances of the same runner by specifying the same <https://docs.bentoml.org/en/latest/concepts/runner.html#resource-allocation|GPU resource configuration>. The service will automatically round robin requests to the runner instances.
```runners:
  runner_name:
    resources:
      <http://nvidia.com/gpu|nvidia.com/gpu>: [1, 1]```
However, as <@U049H10CM46> mentioned, deploying multiple instances of the same runner will hurt batching performance because requests are now spread out to different instances of runners.

Thanks, I wondered if something like that configuration was possible - I also assumed that yes, you'd split batches. We'll try it.

Thanks!