Slackbot
11/24/2022, 10:52 AMlarme (shenyang)
11/25/2022, 8:43 AMbentoml.transformers.save_model(
name="ml-serving-experiments-bento",
pipeline=pipeline,
signatures={
"__call__": {
"batchable": False,
"batch_dim": 0,
},
},
)
And see if the errors are still thereAndrew MacMurray
11/25/2022, 10:04 AM--production flag)larme (shenyang)
11/25/2022, 11:00 AM[str1, str2]?Andrew MacMurray
11/25/2022, 11:06 AM[str1, str2] and will return [prediction1, prediction2] - it’s a pretty big performance gain if the model is given a list rather than individual stringslarme (shenyang)
11/25/2022, 11:08 AM@svc.api(input=Text(), output=JSON())
async def predict(text: str):
return await runner.async_run([text])Andrew MacMurray
11/25/2022, 11:09 AMAndrew MacMurray
11/25/2022, 11:09 AMAndrew MacMurray
11/25/2022, 11:18 AMlarme (shenyang)
11/25/2022, 11:19 AMlarme (shenyang)
11/25/2022, 11:20 AMAndrew MacMurray
11/25/2022, 11:27 AMhey \
-m POST \
-c 20 \
-q 7 \
-z 1m \
-H "Content-Type: application/json" -D "{ 'text': 'Predict me' }" \
-h2 "<http://localhost:3000/predict>"
And 90% of the responses were ~`0.47s` vs the non batched where 90% were ~`0.126s`larme (shenyang)
11/25/2022, 11:31 AMlarme (shenyang)
11/25/2022, 11:31 AMAndrew MacMurray
11/25/2022, 11:32 AMAndrew MacMurray
11/25/2022, 11:35 AMdistilroberta-base https://huggingface.co/distilroberta-baseAndrew MacMurray
11/25/2022, 11:35 AMAndrew MacMurray
11/25/2022, 11:35 AMlarme (shenyang)
11/25/2022, 11:36 AMlarme (shenyang)
11/25/2022, 11:36 AMAndrew MacMurray
11/25/2022, 11:37 AMCharlie Briggs
11/28/2022, 10:26 AMbatch_size to the transformers.Pipeline inference call resulted in a significant performance increase for our test model (~2x inference throughput when using a value over 8)
We can’t see that being used anywhere by Bento, is that something we’re missing or is it the case that Bento doesn’t yet use this optimisation?
https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#pipeline-batchinglarme (shenyang)
11/30/2022, 3:39 AMhey and `distilroberta-base`:
no --production latency
Latency distribution:
10% in 0.0316 secs
25% in 0.0535 secs
50% in 0.0860 secs
75% in 0.1223 secs
90% in 0.1490 secs
95% in 0.1565 secs
99% in 0.3045 secs
--production latency with batching disabled
Latency distribution:
10% in 0.0243 secs
25% in 0.0450 secs
50% in 0.0735 secs
75% in 0.1001 secs
90% in 0.1276 secs
95% in 0.1389 secs
99% in 0.6573 secs
--production latency with batching disabled
Latency distribution:
10% in 0.0478 secs
25% in 0.0744 secs
50% in 0.0939 secs
75% in 0.1199 secs
90% in 0.1362 secs
95% in 0.1489 secs
99% in 0.1700 secslarme (shenyang)
11/30/2022, 3:40 AMAndrew MacMurray
12/05/2022, 9:18 AMclass SentimentRunner(bentoml.Runnable):
SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>", "cpu")
SUPPORTS_CPU_MULTI_THREADING = True
def __init__(self, config: config.Config):
self.model = sentiment.create_model(config)
@bentoml.Runnable.method(batchable=True, batch_dim=0)
def predict(self, inputs: List[str]):
for i in inputs:
sentiment.validate_input(i)
predictions = self.model(
inputs,
batch_size=len(inputs),
top_k=None,
)
return [sentiment.process_result(p) for p in predictions]
(top_k=None lets us return all the labels + scores for each prediction)
We found this was also the best configuration for requests/s (we ran this on a g4dn.xlarge aws instance)
api_server:
workers: 3
runners:
resources:
<http://nvidia.com/gpu|nvidia.com/gpu>: [0, 0, 0]
These were the results for some different configurations (we were using a gradually increasing request rate for these tests to better check when they start to fail)
| Workers | Runners | ~Peak Requests/s | Falls Over? |
| ------- | ------- | ---------------- | ----------- |
| 4 | 4 | 380-400 | No |
| 3 | 3 | 470-480 | No |
| 2 | 2 | 270-330 | Yes |
| 1 | 1 | 170-200 | Yes |