This message was deleted.
# ask-for-help
s
This message was deleted.
f
Hello, according to docs it is possible. You may check out this section of docs: https://docs.bentoml.org/en/latest/concepts/runner.html#reusable-runnable
c
I see - but the service would still need to manually call a specific named runner for prediction (
runner.run
)? Is there a way to have bento handle the calling of multiple of these in say, a rond robin fashion?
this comment mentions runner replica counts https://github.com/bentoml/BentoML/issues/1474#issuecomment-1016025685 but I can't see any reference to that in the docs
f
May I learn why do you want to run as round robin style?
c
we've been experimenting with ray serve vs bentoml and have found that given our test model, we needed to deploy two actors to the same GPU with ray serve to get the throughput we wanted we're seeing lower throughput with bento so were wondering if there was a similar mechanism to employ
From what I can tell, the default strategy is to deploy a runner replica per GPU (https://github.com/bentoml/BentoML/blob/main/src/bentoml/_internal/runner/strategy.py)
f
Have you ever seen the adaptive batching? https://docs.bentoml.org/en/latest/guides/batching.html
c
yes, we've tried that already
s
@Charlie Briggs, you can deploy two instances of the same runner by specifying the same GPU resource configuration. The service will automatically round robin requests to the runner instances.
Copy code
runners:
  runner_name:
    resources:
      <http://nvidia.com/gpu|nvidia.com/gpu>: [1, 1]
However, as @FIRAT TAMUR mentioned, deploying multiple instances of the same runner will hurt batching performance because requests are now spread out to different instances of runners.
c
Thanks, I wondered if something like that configuration was possible - I also assumed that yes, you'd split batches. We'll try it. Thanks!
👍 1