This message was deleted BentoML #ask-for-help

Join Slack

This message was deleted.

# ask-for-help

Slackbot

06/12/2023, 1:40 PM

This message was deleted.

Chaoyu

06/13/2023, 4:14 AM

Hi Yilun, I highly recommend you to check out the new project we just dropped here: https://github.com/bentoml/OpenLLM

Chaoyu

06/13/2023, 4:15 AM

It will make serving LLMs a lot easier, you can assign GPUs to models however you need, with a top level parameter

Aaron Pham

06/13/2023, 7:36 AM

For the BentoML side, you can check out https://docs.bentoml.org/en/latest/guides/scheduling.html Let say that you have 4 GPUs available. Given that you only want 1 instance of the runner, the following configuration can be used:

Copy code

runners:
  resources:
    <http://nvidia.com/gpu|nvidia.com/gpu>: [0,1,2,3]
  workers_per_resources: 0.25

Aaron Pham

06/13/2023, 7:36 AM

This can be achieved with the latest version of bentoml, 1.0.22

Yilun Zhang

06/13/2023, 12:06 PM

Hey @Chaoyu, thanks for the direction, I briefly looked at it but haven’t moved forward with experimentation yet as I see there are limited number of model types that are supported and not sure how it goes with model architectures not mentioned there. It seems to be advised to have support for any models, does it really mean it? I guess I can give it a shot today to see how things goes!

Aaron Pham

06/13/2023, 12:07 PM

what models are you looking for?

Yilun Zhang

06/13/2023, 12:10 PM

Hey @Aaron Pham, thanks for the useful resource, I will definitely check it out! Another issue I found while using LLM with BentoML is, if some models have extra remote code, it sometimes cause issues while hosting in Bentoml where it says the model can’t be loaded with any available model architecture in transformers library. In my specific case, falcon-7b-instruct worked fine but when I tried to host falcon-40b-instruct in 8-bit, I saw this issue. Checking both bentoml model repository, I didn’t see the remote python code being copied over, so it remains mystery to me why 7b works but not 40b one.

Yilun Zhang

06/13/2023, 12:12 PM

@Aaron Pham Further checking, I think if stablelm is supported, llama-family models should be supported as well, so it seems like it does have good coverage 😄. Then since dolly is supported, GPT-j type of models are probably naturally supported as well. I guess initially I was looking for a more complete list of models.

Aaron Pham

06/13/2023, 12:13 PM

hmm the support from bentoml with

trust_remote_code

is pretty limited atm.

Aaron Pham

06/13/2023, 12:13 PM

One thing that I will add soon is to support llama-family generation of models (this includes cpp, vicuna, fastchat even)

Yilun Zhang

06/13/2023, 12:28 PM

I see, how about in openllm? Is there good support on remote code?

Aaron Pham

06/13/2023, 12:31 PM

I wrote support for chatglm, falcon and dolly-v2, which essentially relies on

trust_remote_code

. OpenLLM do have support for

trust_remote_code

so I want to optimistically say that it is well supported 😄

🎉 1

Yilun Zhang

06/13/2023, 12:33 PM

That’s very nice to hear, I will give it a try!

Yilun Zhang

06/13/2023, 12:34 PM

How about 8 bit and/or 4 bit support? (trying to check some of the important checkboxes before testing)

Aaron Pham

06/13/2023, 12:34 PM

Yes we do have 8 bit support. OpenLLM models can already load in 8bit. Note that 8bit is not a silver bullets 🙂

Aaron Pham

06/13/2023, 12:35 PM

since 4 bit is just released recently, I need to dig a bit more. I do want to incorporate a fine tuning api for it first

Yilun Zhang

06/13/2023, 12:36 PM

Got it, thanks for sharing the progress and roadmap!

Yilun Zhang

06/13/2023, 1:11 PM

Is the suggested usage to use openllm commandline for hosting or to integrate with Bentoml? For example, the usecase can be hosting a number of models for demo purpose (called by frontend UI) or benchmarking.

Aaron Pham

06/13/2023, 1:14 PM

Currently, OpenLLM provides a Runner integration with BentoML, which allows you to easily run LLM with other models with BentoML. This means you will have full control in terms of resource allocation for the LLM. If you just want to serve the LLM, then I think it makes sense to just use

openllm start

and let OpenLLM manage the server for you. (esp for the frontend UI, like react or a gradio app)

Yilun Zhang

06/13/2023, 1:16 PM

From the doc, it seems like the hosting is 1-port-to-1-model, is it correct? If I want to host multiple, then I will need to have them on different ports.

Aaron Pham

06/13/2023, 1:18 PM

That is correct. We can possibly support multiple models/port, but it will involve more specific configuration settings, (def achievable)

👍 1

Yilun Zhang

06/13/2023, 1:20 PM

Got it, thanks! I think independent hosting, especially for LLM can be more natural as it’s probably unlikely to host multiple of them. Just wondering, in this case if I’m hosting with openllm, I guess (at least for now) there’s not support for multi-instance and dynamic batching etc. right?

Aaron Pham

06/13/2023, 1:21 PM

Yes, currently I overwrite the configuration set for dynamic batching, but obv we can always support it later.

👍 1

Aaron Pham

06/13/2023, 1:22 PM

Since the models are relatively big, dynamic batching might not make sense for now, since you will end up waiting for the model to run inference nonetheless

Yilun Zhang

06/13/2023, 1:22 PM

Makes sense, thank you for the explanation!

Open in Slack

Previous Next