This message was deleted.
# ask-for-help
s
This message was deleted.
b
Could it be:
Copy code
SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>") # <-- this is a string
Should be:
Copy code
SUPPORTED_RESOURCES = ("<http://nvidia.com/gpu|nvidia.com/gpu>",) # <-- this is a tuple
g
tried to make it tuple, it didn’t work though
also i’ve tried to serve the model today natively with fastAPI, and the latency is 50x times smaller while handling 4x times RPS
b
so if I understand you right, you're saying that GPU runs slower?
g
no, i’m saying that the CPU gets fully utilized on a very low RPS, even without running an inference.
for some reason, the runners (without any computation in it) drain all the CPUs when I’m getting to around 25-30 RPS
@Benjamin Tan - r u using Yatai when running bento on production? I currently containerize the models using bento and then deploy them on k8s without using Yatai.
b
Yeah.
Do u have the BentoDeployment manifest?
Have u verified that the GPU is being used?
g
i’m not using Yatai currently since i’ve seen that it needs S3 as dependency and i don’t work with AWS
i did see that the GPU is in use, though it looks like it doesn’t constantly gets input to run inference on because on CPU bottleneck
then u can see high spikes on the GPU once in a while