This message was deleted.
# ask-for-help
s
This message was deleted.
b
Cc @sauyon
s
Hello! Do you have adaptive batching enabled?
j
I tried using adaptive batching, but I was getting a lot of timeouts on my runners unless I set the max_batch_ms really high, and even then my overall latency skyrocketed
s
Right, so this is with batching disabled? I think our latency story has a lot of kinks that need to be ironed out, but I'm surprised that it gets so high. My initial suspicion would be that requests are being served "out of order", since without batching we have no scheduling at all.
j
Right, no batching. I'm running on single GPU instances. I tried it with a single copy of each runner on the one GPU, and also with up to 4 copies each. More than 2 copies each caused it to slow down real bad as well. There wasn't a huge difference between 1 copy and 2 copies
I tried adjusting the CPU resources for the runners as well, but it didn't seem to make a huge difference.
It works really when under a controlled load test with a single image that's not very big
Once I start sending production traffic to it, though, thats when the p95 latency goes up considerably
at full load, I see around 30-60% gpu utilization, and each model takes up around 1500 MB of gpu memory
s
What's the CPU load like? Are any cores pinned at 100% when you see very high latency?
j
yeah actually, 1 or 2 would typically be pegged or close to pegged
the machines I'm running on are 4 core
s
Right, it sounds like the issue is probably that API server workers are getting overloaded then.
j
That's what it seemed like...like they were falling behind
s
It might be worth just brute forcing it and spawning more API workers manually using
--api-workers n
and seeing what happens?
j
I could give it a shot. I think I ran as many as 7 workers at one point, but I don't remember the results
a
Related question : does bentoML (esp the containers created using it) auto-optimize the computation based on hardware ? For example, if I run the container on cpu (intel vs amd vs something) vs gpu (nvidia versus something else) or tpu or aws inferentia or aws trianium ? Is that a build time decision or runtime ?
j
if you specify a CUDA version in your bentofile.yaml, it will use an nvidia cuda base image, but if you launch it on a machine with no GPU, it will detect that and fall back on CPU. For other hardware I believe you would have to use a custom image. Like for Inferentia, you'd have to compile your model with the Neuron SDK
👍 1
a
Thanks! @Bo is this something BentoML should facilitate and manage without having the user to own it? Esp. for compiling / optimizing the model for target hardware. Is this doable ?
b
@Amar Ramesh Kamat good question. Bentoml could help facilitate that. I think the complexity and edge cases are right now preventing us