This message was deleted BentoML #ask-for-help

Join Slack

This message was deleted.

# ask-for-help

Slackbot

11/24/2022, 10:52 AM

This message was deleted.

👀 1

larme (shenyang)

11/25/2022, 8:43 AM

Could your try first turning off the batching? Just use

Copy code

bentoml.transformers.save_model(
    name="ml-serving-experiments-bento",
    pipeline=pipeline,
    signatures={
        "__call__": {
            "batchable": False,
            "batch_dim": 0,
        },
    },
)

And see if the errors are still there

Andrew MacMurray

11/25/2022, 10:04 AM

Yea it works when turning off the batching (and also without the

--production

flag)

larme (shenyang)

11/25/2022, 11:00 AM

does your model support batching? What's a batch input looks like? Maybe something like

[str1, str2]

Andrew MacMurray

11/25/2022, 11:06 AM

Yea that’s right, you can call it with

[str1, str2]

and will return

[prediction1, prediction2]

- it’s a pretty big performance gain if the model is given a list rather than individual strings

larme (shenyang)

11/25/2022, 11:08 AM

Then when you use runner, you should provide in a batch input format like this:

Copy code

@svc.api(input=Text(), output=JSON())
async def predict(text: str):
    return await runner.async_run([text])

Andrew MacMurray

11/25/2022, 11:09 AM

Oh interesting, will that build up a list of strings then?

Andrew MacMurray

11/25/2022, 11:09 AM

I’ll give it a try now

Andrew MacMurray

11/25/2022, 11:18 AM

It didn’t error! But it was about 3x slower 😂, not really sure what’s going on there

larme (shenyang)

11/25/2022, 11:19 AM

Could you try a few more requests? At the beginning of serving, the batch balancer need some samples to adjust the balance between latency and throughput.

larme (shenyang)

11/25/2022, 11:20 AM

maybe try at least 10 requests and see what's the latency after that

Andrew MacMurray

11/25/2022, 11:27 AM

I ran a load test with hey (runs 7 queries per second from 20 workers and lasts for a minute)

Copy code

hey \
    -m POST \
    -c 20 \
    -q 7 \
    -z 1m \
    -H "Content-Type: application/json" -D "{ 'text': 'Predict me' }" \
    -h2 "<http://localhost:3000/predict>"

And 90% of the responses were ~`0.47s` vs the non batched where 90% were ~`0.126s`

larme (shenyang)

11/25/2022, 11:31 AM

That's interesting. How large is your model? Do you run the model on CPU or GPU?

larme (shenyang)

11/25/2022, 11:31 AM

If the model is public available, could you send a link to me so I can do some benchmarking?

Andrew MacMurray

11/25/2022, 11:32 AM

Oh interesting, I’m running it on my local machine at the moment but can try it on a remote gpu cluster we’re testing on

Andrew MacMurray

11/25/2022, 11:35 AM

It’s not publicly available, but it’s trained from

distilroberta-base

https://huggingface.co/distilroberta-base

Andrew MacMurray

11/25/2022, 11:35 AM

I can try locally with that too and see if they have similar performance

Andrew MacMurray

11/25/2022, 11:35 AM

Thanks for your help btw 🙏

larme (shenyang)

11/25/2022, 11:36 AM

It would be great if you can try that model locally. It help a lot for us for tuning the performance.

👍 1

larme (shenyang)

11/25/2022, 11:36 AM

I will also try to use that model to do some benchmark. It's weekend now here, so I may go back to you next Monday if I have further results~

Andrew MacMurray

11/25/2022, 11:37 AM

Thank you 🙏 almost weekend for us here too so no rush

🙌 1

Charlie Briggs

11/28/2022, 10:26 AM

Hey, I work on the same team as Andrew and have a follow up question We’ve been experimenting with Bento along with other model servers, and have found that adding

batch_size

to the

transformers.Pipeline

inference call resulted in a significant performance increase for our test model (~2x inference throughput when using a value over 8) We can’t see that being used anywhere by Bento, is that something we’re missing or is it the case that Bento doesn’t yet use this optimisation? https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#pipeline-batching

larme (shenyang)

11/30/2022, 3:39 AM

Hi Andrew and Charlie, Here's my benchmark result on a single GPU using

hey

and `distilroberta-base`:

Copy code

no --production latency    
Latency distribution:
  10% in 0.0316 secs
  25% in 0.0535 secs
  50% in 0.0860 secs
  75% in 0.1223 secs
  90% in 0.1490 secs
  95% in 0.1565 secs
  99% in 0.3045 secs  


--production latency with batching disabled
Latency distribution:
  10% in 0.0243 secs
  25% in 0.0450 secs
  50% in 0.0735 secs
  75% in 0.1001 secs
  90% in 0.1276 secs
  95% in 0.1389 secs
  99% in 0.6573 secs


--production latency with batching disabled
Latency distribution:
  10% in 0.0478 secs
  25% in 0.0744 secs
  50% in 0.0939 secs
  75% in 0.1199 secs
  90% in 0.1362 secs
  95% in 0.1489 secs
  99% in 0.1700 secs

larme (shenyang)

11/30/2022, 3:40 AM

I wonder if you have any further benchmark on a GPU cluster?

Andrew MacMurray

12/05/2022, 9:18 AM

Hey! Sorry for the late reply - we ended up using a custom runner on the gpu cluster and got some better performance (we can supply the batch parameter + some other configuration (runner looks like this)

Copy code

class SentimentRunner(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>", "cpu")
    SUPPORTS_CPU_MULTI_THREADING = True

    def __init__(self, config: config.Config):
        self.model = sentiment.create_model(config)

    @bentoml.Runnable.method(batchable=True, batch_dim=0)
    def predict(self, inputs: List[str]):
        for i in inputs:
            sentiment.validate_input(i)

        predictions = self.model(
            inputs,
            batch_size=len(inputs),
            top_k=None,
        )

        return [sentiment.process_result(p) for p in predictions]

(

top_k=None

lets us return all the labels + scores for each prediction) We found this was also the best configuration for requests/s (we ran this on a

g4dn.xlarge

aws instance)

Copy code

api_server:
  workers: 3
runners:
  resources:
    <http://nvidia.com/gpu|nvidia.com/gpu>: [0, 0, 0]

These were the results for some different configurations (we were using a gradually increasing request rate for these tests to better check when they start to fail)

Copy code

| Workers | Runners | ~Peak Requests/s | Falls Over? |
| ------- | ------- | ---------------- | ----------- |
| 4       | 4       | 380-400          | No          |
| 3       | 3       | 470-480          | No          |
| 2       | 2       | 270-330          | Yes         |
| 1       | 1       | 170-200          | Yes         |

36 Views

Open in Slack

Previous Next