This message was deleted.
# ask-for-help
s
This message was deleted.
c
Hi Yilun, I highly recommend you to check out the new project we just dropped here: https://github.com/bentoml/OpenLLM
It will make serving LLMs a lot easier, you can assign GPUs to models however you need, with a top level parameter
a
For the BentoML side, you can check out https://docs.bentoml.org/en/latest/guides/scheduling.html Let say that you have 4 GPUs available. Given that you only want 1 instance of the runner, the following configuration can be used:
Copy code
runners:
  resources:
    <http://nvidia.com/gpu|nvidia.com/gpu>: [0,1,2,3]
  workers_per_resources: 0.25
This can be achieved with the latest version of bentoml, 1.0.22
y
Hey @Chaoyu, thanks for the direction, I briefly looked at it but haven’t moved forward with experimentation yet as I see there are limited number of model types that are supported and not sure how it goes with model architectures not mentioned there. It seems to be advised to have support for any models, does it really mean it? I guess I can give it a shot today to see how things goes!
a
what models are you looking for?
y
Hey @Aaron Pham, thanks for the useful resource, I will definitely check it out! Another issue I found while using LLM with BentoML is, if some models have extra remote code, it sometimes cause issues while hosting in Bentoml where it says the model can’t be loaded with any available model architecture in transformers library. In my specific case, falcon-7b-instruct worked fine but when I tried to host falcon-40b-instruct in 8-bit, I saw this issue. Checking both bentoml model repository, I didn’t see the remote python code being copied over, so it remains mystery to me why 7b works but not 40b one.
@Aaron Pham Further checking, I think if stablelm is supported, llama-family models should be supported as well, so it seems like it does have good coverage 😄. Then since dolly is supported, GPT-j type of models are probably naturally supported as well. I guess initially I was looking for a more complete list of models.
a
hmm the support from bentoml with
trust_remote_code
is pretty limited atm.
One thing that I will add soon is to support llama-family generation of models (this includes cpp, vicuna, fastchat even)
y
I see, how about in openllm? Is there good support on remote code?
a
I wrote support for chatglm, falcon and dolly-v2, which essentially relies on
trust_remote_code
. OpenLLM do have support for
trust_remote_code
so I want to optimistically say that it is well supported 😄
🎉 1
y
That’s very nice to hear, I will give it a try!
How about 8 bit and/or 4 bit support? (trying to check some of the important checkboxes before testing)
a
Yes we do have 8 bit support. OpenLLM models can already load in 8bit. Note that 8bit is not a silver bullets 🙂
since 4 bit is just released recently, I need to dig a bit more. I do want to incorporate a fine tuning api for it first
y
Got it, thanks for sharing the progress and roadmap!
Is the suggested usage to use openllm commandline for hosting or to integrate with Bentoml? For example, the usecase can be hosting a number of models for demo purpose (called by frontend UI) or benchmarking.
a
Currently, OpenLLM provides a Runner integration with BentoML, which allows you to easily run LLM with other models with BentoML. This means you will have full control in terms of resource allocation for the LLM. If you just want to serve the LLM, then I think it makes sense to just use
openllm start
and let OpenLLM manage the server for you. (esp for the frontend UI, like react or a gradio app)
y
From the doc, it seems like the hosting is 1-port-to-1-model, is it correct? If I want to host multiple, then I will need to have them on different ports.
a
That is correct. We can possibly support multiple models/port, but it will involve more specific configuration settings, (def achievable)
👍 1
y
Got it, thanks! I think independent hosting, especially for LLM can be more natural as it’s probably unlikely to host multiple of them. Just wondering, in this case if I’m hosting with openllm, I guess (at least for now) there’s not support for multi-instance and dynamic batching etc. right?
a
Yes, currently I overwrite the configuration set for dynamic batching, but obv we can always support it later.
👍 1
Since the models are relatively big, dynamic batching might not make sense for now, since you will end up waiting for the model to run inference nonetheless
y
Makes sense, thank you for the explanation!