This message was deleted.
# ask-for-help
s
This message was deleted.
c
We’ve been talking about this internally since it makes a lot of sense for LLM and many diffuser models. Would love to hear more about your use case
Are you looking for something like SSE or gRPC streaming on the client side?
c
Yes! Absolutely
My use case is a chatgpt-like chatbot where the tokens get streamed back to the client on the browser one by one, probably through SSE or websocket I need batching, so this probably goes deeper than just adding SSE to bento
c
@Charles CHUDANT got it, do you plan to use an open-source LLM model? what’s the reason you need batching?
c
yes, openchatkit (gpt neox 20B) the model isnt that important though, i think it's mostly the same questions for other models (we also want to run Whisper, which has the same constraints + the fact that it has an encoder model too that needs to be run once before the decoder loop) i realize now that having the decoder loop in the http endpoint, and not in the runner is probably the simplest way to do it • i call my runner from the endpoint for every token i want to generate • i don't have to deal with the problem where some requests finish earlier than other requests in the same batch ◦ or the user interrupting the request midway through • this implies, lots of communication to the runners? maybe its bad? i don't really know, i think this makes sense though => in this case, I don't need the runner to be able to stream back tokens to the endpoint, i only need the endpoint to stream tokens to the end user through SSE I can probably make that work with the kv-cache optimisation by just using two model, meaning two different runners, one that takes the kv-cache as input, and one that does not for context: the problem with kv-cache in this kind of autoregressive model is that you cannot really batch a request that has a cached state and a request that does not have a cached state together - and you can only get the cache state when you've generated at least one token for the request
I need batching because throughput is very bad without batching