Hey all! Looking for people who use LLMs in produc...
# general
n
Hey all! Looking for people who use LLMs in production - I want to learn about how you do monitoring! (as we've been frustrated with all existing tools 😓)
👀 5
g
Great Q. We are just starting, so looking to hear from others. So far, we’ve been using OpenAI and monitoring like normal API (usage, latency, error rates). But I suspect there’s a lot we should be doing and don’t know yet.
r
@Nir Gazit could you mention what you tried and what did not work so far?
n
We're missing a way to measure the quality of the results, and checking for regressions. We ended up building our own internal custom tool. @Gwen Shapira @Ram would love to chat if you're using LLMs in prod 🙂
e
Seems like fertile ground for a new product, especially there’s nothing out there yet!
n
There are plenty of tools trying to do o11y for LLMs, but I think there’s an interesting unexplored edge there
g
@Nir Gazit do share? I was under the impression that only a human can evaluate the quality of an LLM response...
n
@Gwen Shapira yes, but evaluating a regression in the quality can potentially be done by machines in a quantifiable manner
❤️ 1
g
I'm all 👂
a
even if it does require human auditing, there's still probably a useful automation of a test harness environment + aws mechanical turk and the resulting result message bus or dashboard
I would also be curious to hear how others are doing this
g
woah, I didn't think about turk-driven-development 🙂
r
Is one strategy for ‘mechanically’ evaluating response quality to evaluate question rephrasing?
s
it is done manually so far and is very tricky, I started a project to automate some parts of testing as evaluating manually was a big headache for me personally - https://github.com/sundi133/llm-datacraft
would love to have any thoughts and collaborate if interested, I have some ideas to automate qa gen more smartly and ranking an llm endpoint