I have a requirement where the user has to speak according to a predefined script. I want to implement a turn detection algorithm where I can detect whether user has completed sentence (already provided by livekit) and also whether user has spoken a predefined script completely or not. If user goes completely off script, then agent can respond right away, but if user is following the script (preferably with semantic match) then I want to delay the agent response and give the user some more time to complete the script.
How can I achieve it? Any reference on a building custom implementation of turn detection will also work.