This message was deleted BentoML #ask-for-help

Join Slack

This message was deleted.

# ask-for-help

Slackbot

02/17/2023, 5:08 PM

This message was deleted.

Yilun Zhang

02/17/2023, 6:22 PM

I think this is the case the request volume is too large and the runners can’t keep up. For example, when a new request came, all runners are busy so the request can’t be processed. I would do the following checks: • are you enabling batching? If yes, does the max batch size and max latency you set? • How many runners do you initialize? You can initialize more runners to increase the capability in your config.

Shihgian Lee

02/17/2023, 6:25 PM

Hi, thanks for your reply! Yes, we have batching enabled. We use the default latency and batch size. We use the default settings for runners. I believe it scales with the CPU by default. We have 2 runners.

Yilun Zhang

02/17/2023, 7:06 PM

In production mode, it’s the number pf api workers that’s set to maximum capable, but not the number of runners, so if your CPU memory allows, I would advise you increase the number of runners since that will be the bottleneck.

Yilun Zhang

02/17/2023, 7:07 PM

I will also not span too many workers either since that will simply occupy all CPUs, it should be set according to the RPS of your application.

Shihgian Lee

02/17/2023, 7:37 PM

Thanks @Yilun Zhang! We will give that a try!

Yilun Zhang

02/17/2023, 7:39 PM

PS, I have also had similar issues with large transformers models using GPUs and it was very difficult for me to just scale due to the model size and GPU resources. I hope this is going to be easier for CPU runtime and smaller xgboost models 😄

Shihgian Lee

02/17/2023, 7:46 PM

Thanks @Yilun Zhang! We should be able to scale our XGBoost model. We also have transformer based models on BentoML 0.13. I am building another one and plan to deploy it to 1.0.x. We don't deal with direct real time requests. Instead, our requests are put on a queueing system to be served by the models. it is a tradeoff/compromise that we have to make with the business.

Yilun Zhang

02/17/2023, 7:47 PM

That makes sense. If you are processing in offline fashion, you should be bale to utilize the batch processing and process a large batch in single inference.

Shihgian Lee

02/17/2023, 7:50 PM

It depends on the architecture of the upstream services. In our case, batch processed results are not easily consumed by the upstream clients. Event-driven system (plus orchestration platform like Kubernetes) makes integration much easier in terms of reusability and decoupling.

Yilun Zhang

02/17/2023, 8:01 PM

I see, won’t get too deep into the actual setup etc. as it’s out of the scope of your problem 😛

😅 1

Shihgian Lee

02/20/2023, 6:24 PM

Hi @Yilun Zhang I would like to let you know that the runners are dependent on the amount of available CPU, just like the api_server. When I tried to increase the number of runners, it just occurred to me that I set the

SUPPORTS_CPU_MULTI_THREADING

False

when debugging latency issue. I am setting it back to True this morning and redeployed the model service to production. I will monitor it throughout the week to see if that helps to resolve the service unavailable issue. When I set the

SUPPORTS_CPU_MULTI_THREADING

True

, I only get 1 runner. That is by design. Sean confirmed this for me previously. I will keep you posted.

Yilun Zhang

02/21/2023, 1:35 PM

I see, hope things will go well! I haven’t tried hosting any models on CPU so it’s new for me. But if the logic is the same as GPU, then you should be able to specify you want multiple CPUs and create multiple runners to balance the load.

Shihgian Lee

02/24/2023, 11:42 PM

Unfortunately, that didn't solve the problem. We will post a separate thread to ask BentoML folks for advice.

4 Views

Open in Slack

Previous Next