This message was deleted BentoML #ask-for-help

Join Slack

This message was deleted.

# ask-for-help

Slackbot

10/05/2022, 7:31 PM

This message was deleted.

larme (shenyang)

10/05/2022, 7:39 PM

Is the model served on GPU or CPU?

Yakir Saadia

10/05/2022, 7:40 PM

GPU

larme (shenyang)

10/05/2022, 7:45 PM

Could you use htop to inspect if the runner's CPU usage is exhausted? You can start the bentoml production server, then run htop, then press F4 and input

bentoml_cli.worker.runner

. This will isolate the runner process

Yakir Saadia

10/05/2022, 7:47 PM

As I mentioned, I have more than enough free resources 🙂

larme (shenyang)

10/05/2022, 7:52 PM

Ye I see, I'm asking because in some situation even though the GPU utilization is low, the runner has performance bottleneck at CPU part. But I can see that it's not the cause for your case

Yakir Saadia

10/05/2022, 7:53 PM

Do you have any suggestions on how to debug it?

Yakir Saadia

10/05/2022, 7:54 PM

If it happens on the inbound_call when calling the infer

larme (shenyang)

10/05/2022, 8:00 PM

have you tried increasing max latency?

larme (shenyang)

10/05/2022, 8:02 PM

https://docs.bentoml.org/en/latest/guides/batching.html#error-handling fyi

Yakir Saadia

10/05/2022, 8:05 PM

So max latency of 1 second or 1000 ms is not good enough. I can increase the max latency but also increase the workers amount to make sure I am using up my resources efficiently?

Yakir Saadia

10/05/2022, 8:05 PM

If it happens to me from 40 requests per second, I can have too big of a batch size I guess

Yakir Saadia

10/05/2022, 8:07 PM

When I increased my max latency to 10,000 it fails much less than before. When testing on 3K requests (40 rps) it had once 6.27% and on another time 1.5% fail rate

Yakir Saadia

10/05/2022, 8:08 PM

What can I do to make sure my application is stable?

larme (shenyang)

10/05/2022, 8:27 PM

These fails most likely happens at the beginning of the service. Because we use CORK algorithm (a variation of Nagle's algorithm[0]) to determine when to release a batch. We need to analyze historical data to optimizing the parameters of this algorithm in realtime, while at the beginning of serving, we don't have enough historical data, so some failures will happen. Could you try to do 6k requests and see if the failure rate drops to around 3%? [0]: https://en.wikipedia.org/wiki/Nagle%27s_algorithm

Yakir Saadia

10/05/2022, 8:43 PM

@larme (shenyang) Okay, I sent 11K requests first as a warm up. Max latency is defined as 10,000. I sent a total of 6,482 requests at a pace of 120 requests per second. I had 2,597 failed requests (40.06% fail rate) • When I sent 3K requests at a pace of 40 requests per second I had 0 failed requests. • When I sent 4,818 requests at a pace of 60 requests per second I had 460 failed requests (~9.55% fail rate)

larme (shenyang)

10/05/2022, 8:49 PM

what's the result of doing a warm up then send 6k requests at 40 rps? If the failure rate drop then that means at 40 rps the service is stable after some warmup. Then we can do test at 80 rps, 120 rps etc.

Yakir Saadia

10/05/2022, 8:50 PM

@larme (shenyang) Edited my message

larme (shenyang)

10/05/2022, 8:51 PM

ok I think when you do test at 120 rps, could you monitor the runner cpu usage again? Sorry I just want to make sure the bottleneck is not runner's cpu usage

Yakir Saadia

10/05/2022, 8:51 PM

Sure it will take 2 minutes

Yakir Saadia

10/05/2022, 8:53 PM

@larme (shenyang) It doesn't seem to pass the 50% CPU usage

larme (shenyang)

10/05/2022, 8:53 PM

and what's the GPU utilization?

Yakir Saadia

10/05/2022, 8:54 PM

~65%

larme (shenyang)

10/05/2022, 8:58 PM

I'm not entirely sure but one possibility is that when runner is doing CPU works, the GPU is waiting for data input, so nearly half of the GPU resource is not utilized.

Yakir Saadia

10/05/2022, 9:01 PM

And when I increase the max latency I just have lots of requests that stay pending

larme (shenyang)

10/05/2022, 9:03 PM

yes, if the runner is overloaded, then pending requests will accumulate

larme (shenyang)

10/05/2022, 9:04 PM

if your gpu have enough VRAM to run multiple instance of the model, we may spin up multiple runners with a custom strategy. maybe @sauyon or @Sean can help here.

larme (shenyang)

10/05/2022, 9:04 PM

Sorry I need to be offline, it's 5am here~

Yakir Saadia

10/05/2022, 9:05 PM

@larme (shenyang) Really appreciate the help! Generally speaking it means that I can't fully utilize the batch functionality because the runner gets overloaded quite easily

larme (shenyang)

10/05/2022, 9:08 PM

It's something we will investigate. We already have some experimental patches to improve the performance and they will be merged in following releases

Yakir Saadia

10/06/2022, 1:28 AM

@Chaoyu

Sean

10/06/2022, 8:21 AM

The bottom line is that HTTP 429 shouldn’t result in runner handle TypeError. HTTP 429 should be handled by the framework. We will try to reproduce internally.

Yakir Saadia

10/06/2022, 9:37 AM

What can I do in the meantime? I want to have the bento v1 used in production

sauyon

10/06/2022, 10:12 AM

I think the easiest thing to in the meantime is to use single-return value API functions; this seems like it's a bug in the multi-return code.

Yakir Saadia

10/06/2022, 10:15 AM

@sauyon Won't it be not using the batching?

sauyon

10/06/2022, 10:16 AM

You can continue to return a list, just not a tuple of return values.

Yakir Saadia

10/06/2022, 10:17 AM

How would that work? How can I prevent it from returning a tuple of return values?

sauyon

10/06/2022, 10:18 AM

What runner are you using right now?

Yakir Saadia

10/06/2022, 10:18 AM

Pytorch based

sauyon

10/06/2022, 10:19 AM

Is it a custom runner or the default BentoML one?

Yakir Saadia

10/06/2022, 10:19 AM

default

sauyon

10/06/2022, 10:28 AM

Hm, ok, let me see if I can write something up that should hopefully help in the short term; that might take me a little bit, though. Do your models normally return tuples?

Yakir Saadia

10/06/2022, 10:29 AM

My models return a list

Yakir Saadia

10/06/2022, 10:29 AM

I have even made sure that I can control what my model returns by wrapping my inference function with another function and calling this one

sauyon

10/06/2022, 10:29 AM

Oh, I think that's another bug, then---we shouldn't be using the multiple return path at all :(

larme (shenyang)

10/06/2022, 10:30 AM

ye only tuple should trigger the multiple output path

Yakir Saadia

10/06/2022, 10:31 AM

This is the return value

Copy code

def pytorch_predict(self, d):
    return [x.pandas().xyxy for x in self(d).tolist()]

sauyon

10/06/2022, 1:19 PM

Hm, well, I'm now very stumped. Would you be able to provide the result of

type(<your runner>.<method>.run(...))

Yakir Saadia

10/06/2022, 1:20 PM

@sauyon Of course I can. It will take a moment

Yakir Saadia

10/06/2022, 1:22 PM

@sauyon As I was thinking. The type is: <class 'list'>

Yakir Saadia

10/06/2022, 1:25 PM

But probably when it returns the 429 it alters it and returns it in a content type of multiple_outputs

Yakir Saadia

10/06/2022, 1:27 PM

Now want I need is to figure out how to still benefit from bento v1 while avoiding the 429 exception. I can always wrap my inference calls with try and except to make sure it will try again, but if it happens this much at a certain load, it will not solve anything, only pile up more requests at the dispatcher

sauyon

10/06/2022, 1:27 PM

Yeah, that definitely seems to be the case...

Yakir Saadia

10/06/2022, 1:28 PM

Because at the end of the day, I get the 429 while I am far from using up all my resources or the batch size limit I have defined on the runner

sauyon

10/06/2022, 1:28 PM

Let me just find which version we added multiple output support, maybe your code will work before we added that.

sauyon

10/06/2022, 1:28 PM

Or at least we can get to the point where this isn't failing in a bad way.

Yakir Saadia

10/06/2022, 1:28 PM

Would love to 🙂 Thanks

Yakir Saadia

10/06/2022, 1:29 PM

I am using bentoml v1.0.6

sauyon

10/06/2022, 1:31 PM

Ok, can you try 1.0.5 and see if that behaves any better?

Yakir Saadia

10/06/2022, 1:31 PM

Sure one sec

Yakir Saadia

10/06/2022, 1:39 PM

Do I need to build the bento again and save the models again with the v1.0.5

Yakir Saadia

10/06/2022, 1:39 PM

sauyon

10/06/2022, 1:40 PM

Nope, you should just be able to run it with 1.0.5.

Yakir Saadia

10/06/2022, 1:45 PM

I get the following exception when trying to load my application now: ValueError: expected API function to have arguments (files_stream, form_stream, [context]), got (files_stream, form_stream, ctx)

sauyon

10/06/2022, 1:48 PM

Ugh, that was a bug with multipart descriptors. Are you able to use a BentoML built from a commit?

sauyon

10/06/2022, 1:49 PM

It should be fixed in 43c4eae6, if you can. Otherwise I've written a custom strategy which might help even on the current version.

Yakir Saadia

10/06/2022, 1:51 PM

Well I'd rather keep the version 1.0.6 that I have been using and I don't mind editing some of the code to make stuff work better 🙂

Yakir Saadia

10/06/2022, 1:51 PM

Where do you think the problem lies? I am talking about the 429 error of course

Yakir Saadia

10/06/2022, 1:52 PM

What's the custom strategy?

sauyon

10/06/2022, 1:55 PM

Copy code

from bentoml._internal.runner.strategy import Strategy
from bentoml._internal.runner.strategy import DefaultStrategy
import math


class NxStrategy(Strategy):
    n = 2

    @classmethod
    def get_worker_count(cls, runnable_class, resource_request):
        return cls.n * DefaultStrategy.get_worker_count(
            runnable_class, resource_request
        )

    @classmethod
    def get_worker_env(
        cls,
        runnable_class,
        resource_request,
        worker_index,
    ):
        pseudo_worker_index = math.floor(worker_index / cls.n)
        return DefaultStrategy.get_worker_env(
            runnable_class, resource_request, pseudo_worker_index
        )


object.__setattr__(<your runner>, "scheduling_strategy", NxStrategy)

This is the code you can put into your service; it's a bit hacky because we don't really support custom strategies at the moment, at least for default runners, but it should hopefully work.

Yakir Saadia

10/06/2022, 1:57 PM

And this might solve my 429 issue?

sauyon

10/06/2022, 1:57 PM

That should spawn

times the number of runner processes that would normally be spawned by our default strategy; basically higher

should hopefully speed things up, as long as you have enough memory. The 429 error is something that happens when our runner can't keep up, for one reason or another, so spawning more runner processes should help with saturating your resources and avoiding that bottleneck.

Yakir Saadia

10/06/2022, 1:57 PM

Because it will override the scheduling strategy that is already defined in the runner?

Yakir Saadia

10/06/2022, 1:58 PM

But won't creating another runner, means not fully using the batching functionality?

Yakir Saadia

10/06/2022, 1:59 PM

And it will load the model in the new runner again right?

sauyon

10/06/2022, 2:00 PM

In theory yes, but at the moment it appears that some pre- or post-processing code in your runner process is preventing it from reaching full utilization; my guess is there's some relatively heavy python which is slowing down the runner and preventing it from utilizing all resources available.

sauyon

10/06/2022, 2:00 PM

And yes, it would be loading the model into memory again.

Yakir Saadia

10/06/2022, 2:03 PM

When you say heavy pre- or post-processing code in my runner process that prevents reaching full utilization, you mean in my model's inference or my api endpoint?

sauyon

10/06/2022, 2:04 PM

Inside the model's inference; the API server should be spawning enough processes for that not to be a problem

59 Views

Open in Slack

Previous Next