Which part in particular is poor? The latency? Accuracy? One trade-off we made is optimizing for latency over accuracy. We send smaller audio context windows (I can’t remember exactly, but less than the recommended 100ms) to Google. The goal with KITT is to get as close to real, human-like interaction as possible. Depending on your needs, you may want to tweak the context window being sent to STT. We also may have some other configuration parameters set on the Google Cloud side which affect performance (latency or accuracy) in some way. @boundless-energy-78552 knows more than I do here.