This message was deleted BentoML #ask-for-help

Join Slack

This message was deleted.

# ask-for-help

Slackbot

03/29/2023, 4:36 PM

This message was deleted.

Chaoyu

03/29/2023, 9:44 PM

We’ve been talking about this internally since it makes a lot of sense for LLM and many diffuser models. Would love to hear more about your use case

Chaoyu

03/29/2023, 9:45 PM

Are you looking for something like SSE or gRPC streaming on the client side?

Charles CHUDANT

03/30/2023, 8:04 AM

Yes! Absolutely

Charles CHUDANT

03/30/2023, 8:08 AM

My use case is a chatgpt-like chatbot where the tokens get streamed back to the client on the browser one by one, probably through SSE or websocket I need batching, so this probably goes deeper than just adding SSE to bento

Chaoyu

03/30/2023, 5:14 PM

@Charles CHUDANT got it, do you plan to use an open-source LLM model? what’s the reason you need batching?

Charles CHUDANT

03/31/2023, 2:04 PM

yes, openchatkit (gpt neox 20B) the model isnt that important though, i think it's mostly the same questions for other models (we also want to run Whisper, which has the same constraints + the fact that it has an encoder model too that needs to be run once before the decoder loop) i realize now that having the decoder loop in the http endpoint, and not in the runner is probably the simplest way to do it • i call my runner from the endpoint for every token i want to generate • i don't have to deal with the problem where some requests finish earlier than other requests in the same batch ◦ or the user interrupting the request midway through • this implies, lots of communication to the runners? maybe its bad? i don't really know, i think this makes sense though => in this case, I don't need the runner to be able to stream back tokens to the endpoint, i only need the endpoint to stream tokens to the end user through SSE I can probably make that work with the kv-cache optimisation by just using two model, meaning two different runners, one that takes the kv-cache as input, and one that does not for context: the problem with kv-cache in this kind of autoregressive model is that you cannot really batch a request that has a cached state and a request that does not have a cached state together - and you can only get the cache state when you've generated at least one token for the request

Charles CHUDANT

03/31/2023, 2:09 PM

I need batching because throughput is very bad without batching

Open in Slack

Previous Next