BentoML

image.png

For example, the RTT for single call is 117ms with adaptive_batching disabled. However, running this inference locally in onnxruntime-gpu is 55ms.

A further investigation shows that the inference time in BentoML is about 86ms, which is still slower than 55ms.

But other 20ms are used in sending requests and fetching results. Anyway to speed this up?

Hi, I wonder how large is the request body? For a large request the http request parser in python may be the bottleneck. Could you try sending the request using grpc?
For your reference we have guide about grpc serving here: <https://docs.bentoml.org/en/latest/guides/grpc.html|https://docs.bentoml.org/en/latest/guides/grpc.html>

Thanks! Each inference is a (1, 3, 224,224)  tensor, which is a 3 channel image. I will try grpc and fix the RTT problem.

Is there any way to increase the inference speed inside the server?