yes, openchatkit (gpt neox 20B)
the model isnt that important though, i think it's mostly the same questions for other models
(we also want to run Whisper, which has the same constraints + the fact that it has an encoder model too that needs to be run once before the decoder loop)
i realize now that having the decoder loop in the http endpoint, and not in the runner is probably the simplest way to do it
• i call my runner from the endpoint for every token i want to generate
• i don't have to deal with the problem where some requests finish earlier than other requests in the same batch
◦ or the user interrupting the request midway through
• this implies, lots of communication to the runners? maybe its bad? i don't really know, i think this makes sense though
=> in this case, I don't need the runner to be able to stream back tokens to the endpoint, i only need the endpoint to stream tokens to the end user through SSE
I can probably make that work with the kv-cache optimisation by just using two model, meaning two different runners, one that takes the kv-cache as input, and one that does not
for context: the problem with kv-cache in this kind of autoregressive model is that you cannot really batch a request that has a cached state and a request that does not have a cached state together - and you can only get the cache state when you've generated at least one token for the request