This message was deleted.
# ask-for-help
s
This message was deleted.
👀 1
l
Could your try first turning off the batching? Just use
Copy code
bentoml.transformers.save_model(
    name="ml-serving-experiments-bento",
    pipeline=pipeline,
    signatures={
        "__call__": {
            "batchable": False,
            "batch_dim": 0,
        },
    },
)
And see if the errors are still there
a
Yea it works when turning off the batching (and also without the
--production
flag)
l
does your model support batching? What's a batch input looks like? Maybe something like
[str1, str2]
?
a
Yea that’s right, you can call it with
[str1, str2]
and will return
[prediction1, prediction2]
- it’s a pretty big performance gain if the model is given a list rather than individual strings
l
Then when you use runner, you should provide in a batch input format like this:
Copy code
@svc.api(input=Text(), output=JSON())
async def predict(text: str):
    return await runner.async_run([text])
a
Oh interesting, will that build up a list of strings then?
I’ll give it a try now
It didn’t error! But it was about 3x slower 😂, not really sure what’s going on there
l
Could you try a few more requests? At the beginning of serving, the batch balancer need some samples to adjust the balance between latency and throughput.
maybe try at least 10 requests and see what's the latency after that
a
I ran a load test with hey (runs 7 queries per second from 20 workers and lasts for a minute)
Copy code
hey \
    -m POST \
    -c 20 \
    -q 7 \
    -z 1m \
    -H "Content-Type: application/json" -D "{ 'text': 'Predict me' }" \
    -h2 "<http://localhost:3000/predict>"
And 90% of the responses were ~`0.47s` vs the non batched where 90% were ~`0.126s`
l
That's interesting. How large is your model? Do you run the model on CPU or GPU?
If the model is public available, could you send a link to me so I can do some benchmarking?
a
Oh interesting, I’m running it on my local machine at the moment but can try it on a remote gpu cluster we’re testing on
It’s not publicly available, but it’s trained from
distilroberta-base
https://huggingface.co/distilroberta-base
I can try locally with that too and see if they have similar performance
Thanks for your help btw 🙏
l
It would be great if you can try that model locally. It help a lot for us for tuning the performance.
👍 1
I will also try to use that model to do some benchmark. It's weekend now here, so I may go back to you next Monday if I have further results~
a
Thank you 🙏 almost weekend for us here too so no rush
🙌 1
c
Hey, I work on the same team as Andrew and have a follow up question We’ve been experimenting with Bento along with other model servers, and have found that adding
batch_size
to the
transformers.Pipeline
inference call resulted in a significant performance increase for our test model (~2x inference throughput when using a value over 8) We can’t see that being used anywhere by Bento, is that something we’re missing or is it the case that Bento doesn’t yet use this optimisation? https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#pipeline-batching
l
Hi Andrew and Charlie, Here's my benchmark result on a single GPU using
hey
and `distilroberta-base`:
Copy code
no --production latency    
Latency distribution:
  10% in 0.0316 secs
  25% in 0.0535 secs
  50% in 0.0860 secs
  75% in 0.1223 secs
  90% in 0.1490 secs
  95% in 0.1565 secs
  99% in 0.3045 secs  


--production latency with batching disabled
Latency distribution:
  10% in 0.0243 secs
  25% in 0.0450 secs
  50% in 0.0735 secs
  75% in 0.1001 secs
  90% in 0.1276 secs
  95% in 0.1389 secs
  99% in 0.6573 secs


--production latency with batching disabled
Latency distribution:
  10% in 0.0478 secs
  25% in 0.0744 secs
  50% in 0.0939 secs
  75% in 0.1199 secs
  90% in 0.1362 secs
  95% in 0.1489 secs
  99% in 0.1700 secs
I wonder if you have any further benchmark on a GPU cluster?
a
Hey! Sorry for the late reply - we ended up using a custom runner on the gpu cluster and got some better performance (we can supply the batch parameter + some other configuration (runner looks like this)
Copy code
class SentimentRunner(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True

    def __init__(self, config: config.Config):
        self.model = sentiment.create_model(config)

    @bentoml.Runnable.method(batchable=True, batch_dim=0)
    def predict(self, inputs: List[str]):
        for i in inputs:
            sentiment.validate_input(i)

        predictions = self.model(
            inputs,
            batch_size=len(inputs),
            top_k=None,
        )

        return [sentiment.process_result(p) for p in predictions]
(
top_k=None
lets us return all the labels + scores for each prediction) We found this was also the best configuration for requests/s (we ran this on a
g4dn.xlarge
aws instance)
Copy code
api_server:
  workers: 3
runners:
  resources:
    <http://nvidia.com/gpu|nvidia.com/gpu>: [0, 0, 0]
These were the results for some different configurations (we were using a gradually increasing request rate for these tests to better check when they start to fail)
Copy code
| Workers | Runners | ~Peak Requests/s | Falls Over? |
| ------- | ------- | ---------------- | ----------- |
| 4       | 4       | 380-400          | No          |
| 3       | 3       | 470-480          | No          |
| 2       | 2       | 270-330          | Yes         |
| 1       | 1       | 170-200          | Yes         |