This message was deleted.
# ask-for-help
s
This message was deleted.
l
Is the model served on GPU or CPU?
y
GPU
l
Could you use htop to inspect if the runner's CPU usage is exhausted? You can start the bentoml production server, then run htop, then press F4 and input
bentoml_cli.worker.runner
. This will isolate the runner process
y
As I mentioned, I have more than enough free resources 🙂
l
Ye I see, I'm asking because in some situation even though the GPU utilization is low, the runner has performance bottleneck at CPU part. But I can see that it's not the cause for your case
y
Do you have any suggestions on how to debug it?
If it happens on the inbound_call when calling the infer
l
have you tried increasing max latency?
y
So max latency of 1 second or 1000 ms is not good enough. I can increase the max latency but also increase the workers amount to make sure I am using up my resources efficiently?
If it happens to me from 40 requests per second, I can have too big of a batch size I guess
When I increased my max latency to 10,000 it fails much less than before. When testing on 3K requests (40 rps) it had once 6.27% and on another time 1.5% fail rate
What can I do to make sure my application is stable?
l
These fails most likely happens at the beginning of the service. Because we use CORK algorithm (a variation of Nagle's algorithm[0]) to determine when to release a batch. We need to analyze historical data to optimizing the parameters of this algorithm in realtime, while at the beginning of serving, we don't have enough historical data, so some failures will happen. Could you try to do 6k requests and see if the failure rate drops to around 3%? [0]: https://en.wikipedia.org/wiki/Nagle%27s_algorithm
y
@larme (shenyang) Okay, I sent 11K requests first as a warm up. Max latency is defined as 10,000. I sent a total of 6,482 requests at a pace of 120 requests per second. I had 2,597 failed requests (40.06% fail rate) • When I sent 3K requests at a pace of 40 requests per second I had 0 failed requests. • When I sent 4,818 requests at a pace of 60 requests per second I had 460 failed requests (~9.55% fail rate)
l
what's the result of doing a warm up then send 6k requests at 40 rps? If the failure rate drop then that means at 40 rps the service is stable after some warmup. Then we can do test at 80 rps, 120 rps etc.
y
@larme (shenyang) Edited my message
l
ok I think when you do test at 120 rps, could you monitor the runner cpu usage again? Sorry I just want to make sure the bottleneck is not runner's cpu usage
y
Sure it will take 2 minutes
@larme (shenyang) It doesn't seem to pass the 50% CPU usage
l
and what's the GPU utilization?
y
~65%
l
I'm not entirely sure but one possibility is that when runner is doing CPU works, the GPU is waiting for data input, so nearly half of the GPU resource is not utilized.
y
And when I increase the max latency I just have lots of requests that stay pending
l
yes, if the runner is overloaded, then pending requests will accumulate
if your gpu have enough VRAM to run multiple instance of the model, we may spin up multiple runners with a custom strategy. maybe @sauyon or @Sean can help here.
Sorry I need to be offline, it's 5am here~
y
@larme (shenyang) Really appreciate the help! Generally speaking it means that I can't fully utilize the batch functionality because the runner gets overloaded quite easily
l
It's something we will investigate. We already have some experimental patches to improve the performance and they will be merged in following releases
y
@Chaoyu
s
The bottom line is that HTTP 429 shouldn’t result in runner handle TypeError. HTTP 429 should be handled by the framework. We will try to reproduce internally.
y
What can I do in the meantime? I want to have the bento v1 used in production
s
I think the easiest thing to in the meantime is to use single-return value API functions; this seems like it's a bug in the multi-return code.
y
@sauyon Won't it be not using the batching?
s
You can continue to return a list, just not a tuple of return values.
y
How would that work? How can I prevent it from returning a tuple of return values?
s
What runner are you using right now?
y
Pytorch based
s
Is it a custom runner or the default BentoML one?
y
default
s
Hm, ok, let me see if I can write something up that should hopefully help in the short term; that might take me a little bit, though. Do your models normally return tuples?
y
My models return a list
I have even made sure that I can control what my model returns by wrapping my inference function with another function and calling this one
s
Oh, I think that's another bug, then---we shouldn't be using the multiple return path at all :(
l
ye only tuple should trigger the multiple output path
y
This is the return value
Copy code
def pytorch_predict(self, d):
    return [x.pandas().xyxy for x in self(d).tolist()]
s
Hm, well, I'm now very stumped. Would you be able to provide the result of
type(<your runner>.<method>.run(...))
?
y
@sauyon Of course I can. It will take a moment
@sauyon As I was thinking. The type is: <class 'list'>
But probably when it returns the 429 it alters it and returns it in a content type of multiple_outputs
Now want I need is to figure out how to still benefit from bento v1 while avoiding the 429 exception. I can always wrap my inference calls with try and except to make sure it will try again, but if it happens this much at a certain load, it will not solve anything, only pile up more requests at the dispatcher
s
Yeah, that definitely seems to be the case...
y
Because at the end of the day, I get the 429 while I am far from using up all my resources or the batch size limit I have defined on the runner
s
Let me just find which version we added multiple output support, maybe your code will work before we added that.
Or at least we can get to the point where this isn't failing in a bad way.
y
Would love to 🙂 Thanks
I am using bentoml v1.0.6
s
Ok, can you try 1.0.5 and see if that behaves any better?
y
Sure one sec
Do I need to build the bento again and save the models again with the v1.0.5
?
s
Nope, you should just be able to run it with 1.0.5.
y
I get the following exception when trying to load my application now: ValueError: expected API function to have arguments (files_stream, form_stream, [context]), got (files_stream, form_stream, ctx)
s
Ugh, that was a bug with multipart descriptors. Are you able to use a BentoML built from a commit?
It should be fixed in 43c4eae6, if you can. Otherwise I've written a custom strategy which might help even on the current version.
y
Well I'd rather keep the version 1.0.6 that I have been using and I don't mind editing some of the code to make stuff work better 🙂
Where do you think the problem lies? I am talking about the 429 error of course
What's the custom strategy?
s
Copy code
from bentoml._internal.runner.strategy import Strategy
from bentoml._internal.runner.strategy import DefaultStrategy
import math


class NxStrategy(Strategy):
    n = 2

    @classmethod
    def get_worker_count(cls, runnable_class, resource_request):
        return cls.n * DefaultStrategy.get_worker_count(
            runnable_class, resource_request
        )

    @classmethod
    def get_worker_env(
        cls,
        runnable_class,
        resource_request,
        worker_index,
    ):
        pseudo_worker_index = math.floor(worker_index / cls.n)
        return DefaultStrategy.get_worker_env(
            runnable_class, resource_request, pseudo_worker_index
        )


object.__setattr__(<your runner>, "scheduling_strategy", NxStrategy)
This is the code you can put into your service; it's a bit hacky because we don't really support custom strategies at the moment, at least for default runners, but it should hopefully work.
y
And this might solve my 429 issue?
s
That should spawn
n
times the number of runner processes that would normally be spawned by our default strategy; basically higher
n
should hopefully speed things up, as long as you have enough memory. The 429 error is something that happens when our runner can't keep up, for one reason or another, so spawning more runner processes should help with saturating your resources and avoiding that bottleneck.
y
Because it will override the scheduling strategy that is already defined in the runner?
But won't creating another runner, means not fully using the batching functionality?
And it will load the model in the new runner again right?
s
In theory yes, but at the moment it appears that some pre- or post-processing code in your runner process is preventing it from reaching full utilization; my guess is there's some relatively heavy python which is slowing down the runner and preventing it from utilizing all resources available.
And yes, it would be loading the model into memory again.
y
When you say heavy pre- or post-processing code in my runner process that prevents reaching full utilization, you mean in my model's inference or my api endpoint?
s
Inside the model's inference; the API server should be spawning enough processes for that not to be a problem