This message was deleted BentoML #ask-for-help

Join Slack

This message was deleted.

# ask-for-help

Slackbot

02/15/2023, 4:54 PM

This message was deleted.

02/15/2023, 4:57 PM

Cc @sauyon

sauyon

02/16/2023, 9:47 PM

Hello! Do you have adaptive batching enabled?

Jim Rohrer

02/16/2023, 10:06 PM

I tried using adaptive batching, but I was getting a lot of timeouts on my runners unless I set the max_batch_ms really high, and even then my overall latency skyrocketed

sauyon

02/16/2023, 10:09 PM

Right, so this is with batching disabled? I think our latency story has a lot of kinks that need to be ironed out, but I'm surprised that it gets so high. My initial suspicion would be that requests are being served "out of order", since without batching we have no scheduling at all.

Jim Rohrer

02/16/2023, 10:15 PM

Right, no batching. I'm running on single GPU instances. I tried it with a single copy of each runner on the one GPU, and also with up to 4 copies each. More than 2 copies each caused it to slow down real bad as well. There wasn't a huge difference between 1 copy and 2 copies

Jim Rohrer

02/16/2023, 10:15 PM

I tried adjusting the CPU resources for the runners as well, but it didn't seem to make a huge difference.

Jim Rohrer

02/16/2023, 10:15 PM

It works really when under a controlled load test with a single image that's not very big

Jim Rohrer

02/16/2023, 10:16 PM

Once I start sending production traffic to it, though, thats when the p95 latency goes up considerably

Jim Rohrer

02/16/2023, 10:17 PM

at full load, I see around 30-60% gpu utilization, and each model takes up around 1500 MB of gpu memory

sauyon

02/16/2023, 10:18 PM

What's the CPU load like? Are any cores pinned at 100% when you see very high latency?

Jim Rohrer

02/16/2023, 10:19 PM

yeah actually, 1 or 2 would typically be pegged or close to pegged

Jim Rohrer

02/16/2023, 10:20 PM

the machines I'm running on are 4 core

sauyon

02/16/2023, 10:20 PM

Right, it sounds like the issue is probably that API server workers are getting overloaded then.

Jim Rohrer

02/16/2023, 10:21 PM

That's what it seemed like...like they were falling behind

sauyon

02/16/2023, 10:21 PM

It might be worth just brute forcing it and spawning more API workers manually using

--api-workers n

and seeing what happens?

Jim Rohrer

02/16/2023, 10:23 PM

I could give it a shot. I think I ran as many as 7 workers at one point, but I don't remember the results

Amar Ramesh Kamat

02/24/2023, 10:23 AM

Related question : does bentoML (esp the containers created using it) auto-optimize the computation based on hardware ? For example, if I run the container on cpu (intel vs amd vs something) vs gpu (nvidia versus something else) or tpu or aws inferentia or aws trianium ? Is that a build time decision or runtime ?

Jim Rohrer

02/24/2023, 3:14 PM

if you specify a CUDA version in your bentofile.yaml, it will use an nvidia cuda base image, but if you launch it on a machine with no GPU, it will detect that and fall back on CPU. For other hardware I believe you would have to use a custom image. Like for Inferentia, you'd have to compile your model with the Neuron SDK

👍 1

Amar Ramesh Kamat

02/24/2023, 3:21 PM

Thanks! @Bo is this something BentoML should facilitate and manage without having the user to own it? Esp. for compiling / optimizing the model for target hardware. Is this doable ?

02/25/2023, 4:28 PM

@Amar Ramesh Kamat good question. Bentoml could help facilitate that. I think the complexity and edge cases are right now preventing us

Open in Slack

Previous Next