This message was deleted.
# ask-for-help
s
This message was deleted.
y
I think this is the case the request volume is too large and the runners can’t keep up. For example, when a new request came, all runners are busy so the request can’t be processed. I would do the following checks: • are you enabling batching? If yes, does the max batch size and max latency you set? • How many runners do you initialize? You can initialize more runners to increase the capability in your config.
s
Hi, thanks for your reply! Yes, we have batching enabled. We use the default latency and batch size. We use the default settings for runners. I believe it scales with the CPU by default. We have 2 runners.
y
In production mode, it’s the number pf api workers that’s set to maximum capable, but not the number of runners, so if your CPU memory allows, I would advise you increase the number of runners since that will be the bottleneck.
I will also not span too many workers either since that will simply occupy all CPUs, it should be set according to the RPS of your application.
s
Thanks @Yilun Zhang! We will give that a try!
y
PS, I have also had similar issues with large transformers models using GPUs and it was very difficult for me to just scale due to the model size and GPU resources. I hope this is going to be easier for CPU runtime and smaller xgboost models 😄
s
Thanks @Yilun Zhang! We should be able to scale our XGBoost model. We also have transformer based models on BentoML 0.13. I am building another one and plan to deploy it to 1.0.x. We don't deal with direct real time requests. Instead, our requests are put on a queueing system to be served by the models. it is a tradeoff/compromise that we have to make with the business.
y
That makes sense. If you are processing in offline fashion, you should be bale to utilize the batch processing and process a large batch in single inference.
s
It depends on the architecture of the upstream services. In our case, batch processed results are not easily consumed by the upstream clients. Event-driven system (plus orchestration platform like Kubernetes) makes integration much easier in terms of reusability and decoupling.
y
I see, won’t get too deep into the actual setup etc. as it’s out of the scope of your problem 😛
😅 1
s
Hi @Yilun Zhang I would like to let you know that the runners are dependent on the amount of available CPU, just like the api_server. When I tried to increase the number of runners, it just occurred to me that I set the
SUPPORTS_CPU_MULTI_THREADING
to
False
when debugging latency issue. I am setting it back to True this morning and redeployed the model service to production. I will monitor it throughout the week to see if that helps to resolve the service unavailable issue. When I set the
SUPPORTS_CPU_MULTI_THREADING
to
True
, I only get 1 runner. That is by design. Sean confirmed this for me previously. I will keep you posted.
y
I see, hope things will go well! I haven’t tried hosting any models on CPU so it’s new for me. But if the logic is the same as GPU, then you should be able to specify you want multiple CPUs and create multiple runners to balance the load.
s
Unfortunately, that didn't solve the problem. We will post a separate thread to ask BentoML folks for advice.