forked from livekit/agents
-
Notifications
You must be signed in to change notification settings - Fork 785
feat: Intelligent Interruption Handling for Voice Agents #486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
anishh15
wants to merge
11
commits into
Dark-Sys-Jenkins:main
Choose a base branch
from
anishh15:feature/interrupt-handler-anish-laddha
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
ce85e82
feat: add InterruptionFilter module for intelligent backchannel detec…
anishh15 0d0e3c4
feat: add interruption filter configuration to AgentSession
anishh15 b980fe6
feat: integrate interruption filter into agent activity loop
anishh15 b89f51b
feat: skip EOU detection for filtered backchannel utterances
anishh15 f7a375b
feat: export InterruptionFilter in public API
anishh15 cb01b25
fix: add missing trace span initialization
anishh15 95aaac1
test: add unit tests for InterruptionFilter
anishh15 2270322
docs: add demo agent for testing interruption handling
anishh15 92a8f4e
docs: add proof transcript demonstrating feature functionality
anishh15 0570b51
docs: add CHANGES.md with implementation details
anishh15 82d2fbb
fix: address Copilot review feedback
anishh15 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| # LiveKit Intelligent Interruption Handler | ||
|
|
||
| This implementation adds intelligent backchannel filtering to LiveKit voice agents. The agent can now distinguish between passive acknowledgements ("yeah", "mhmm", "okay") and actual interruption commands ("stop", "wait", "no"). | ||
|
|
||
| ## Problem Solved | ||
|
|
||
| When users provide backchannel feedback while an agent is speaking, the default VAD would interrupt the agent. This created a choppy conversation experience. Now: | ||
|
|
||
| - **Agent is speaking + user says "mhmm"** → Agent continues seamlessly | ||
| - **Agent is speaking + user says "stop"** → Agent stops immediately | ||
| - **Agent is silent + user says "yeah"** → Agent responds normally | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ```bash | ||
| # Install dependencies | ||
| cd livekit-agents | ||
| pip install -e . | ||
|
|
||
| # Set up environment variables | ||
| export LIVEKIT_URL="wss://your-livekit-server" | ||
| export LIVEKIT_API_KEY="your-api-key" | ||
| export LIVEKIT_API_SECRET="your-api-secret" | ||
| export OPENAI_API_KEY="your-openai-key" | ||
| export DEEPGRAM_API_KEY="your-deepgram-key" | ||
|
|
||
| # Run the demo agent | ||
| cd examples/voice_agents | ||
| python interrupt_demo.py dev | ||
| ``` | ||
|
|
||
| Then connect via [LiveKit Playground](https://agents-playground.livekit.io/). | ||
|
|
||
| ## How It Works | ||
|
|
||
| ### Architecture | ||
|
|
||
| ``` | ||
| User Speech → VAD → STT → InterruptionFilter → Agent Response | ||
| ↓ | ||
| Checks agent state: | ||
| - Speaking? Filter backchannels | ||
| - Silent? Allow all input | ||
| ``` | ||
|
|
||
| ### Key Components | ||
|
|
||
| 1. **InterruptionFilter** (`livekit/agents/voice/interruption_filter.py`) | ||
| - Core filtering logic with configurable word lists | ||
| - `should_interrupt(transcript, agent_state)` returns True/False | ||
|
|
||
| 2. **Word Lists** (configurable) | ||
| - `DEFAULT_BACKCHANNEL_WORDS`: yeah, ok, mhmm, uh-huh, right, sure, etc. | ||
| - `DEFAULT_INTERRUPT_WORDS`: stop, wait, no, actually, hold on, etc. | ||
|
|
||
| 3. **Integration Points** | ||
| - `agent_activity.py`: Captures agent state, applies filter | ||
| - `audio_recognition.py`: Skips EOU detection for filtered utterances | ||
|
|
||
| ### Configuration | ||
|
|
||
| ```python | ||
| from livekit.agents.voice import InterruptionFilterConfig | ||
|
|
||
| # Custom configuration | ||
| config = InterruptionFilterConfig( | ||
| backchannel_words={"yeah", "ok", "mhmm"}, | ||
| interrupt_words={"stop", "wait"}, | ||
| enabled=True | ||
| ) | ||
| ``` | ||
|
|
||
| ## Files Changed | ||
|
|
||
| | File | Description | | ||
| |------|-------------| | ||
| | `interruption_filter.py` | NEW - Core filter logic | | ||
| | `agent_activity.py` | State tracking and filter integration | | ||
| | `audio_recognition.py` | Skip EOU for filtered utterances | | ||
| | `agent_session.py` | Configuration options | | ||
| | `__init__.py` | Public API exports | | ||
|
|
||
| ## Testing | ||
|
|
||
| ```bash | ||
| # Run unit tests from the project root | ||
| python -m pytest livekit-agents/tests/test_interruption_filter.py -v | ||
| ``` | ||
|
|
||
| ## Proof of Functionality | ||
|
|
||
| See the `proof/` folder for: | ||
| - `transcript.txt` - Annotated conversation transcript | ||
| - Screen recording demonstrating the feature | ||
|
|
||
| ## Key Design Decisions | ||
|
|
||
| 1. **State Capture Timing**: Agent state is captured when user starts speaking, not when filter runs. This handles race conditions between VAD and STT. | ||
|
|
||
| 2. **EOU Detection Skip**: When backchannel is filtered, End-of-Utterance detection is also skipped to prevent new response generation. | ||
|
|
||
| 3. **Flexible Matching**: Hyphenated words like "uh-huh" match "uh-huh", "uh huh", and "uhhuh" to handle STT variations. | ||
|
|
||
| 4. **Safe Default**: Unknown words while agent is speaking trigger interruption (could be important). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,169 @@ | ||
| # Intelligent Interruption Handling Demo Agent | ||
| # | ||
| # This example demonstrates the intelligent interruption handling feature that | ||
| # distinguishes between passive acknowledgements (backchanneling) and actual | ||
| # interruptions during voice conversations. | ||
| # | ||
| # When the agent is speaking: | ||
| # - "yeah", "ok", "hmm" → Agent continues speaking (backchannel) | ||
| # - "stop", "wait", "no" → Agent stops immediately (interruption) | ||
| # - "yeah but wait" → Agent stops (mixed input with interrupt word) | ||
| # | ||
| # When the agent is silent: | ||
| # - All user input is processed normally, including backchannel words | ||
| # | ||
| # Prerequisites: | ||
| # 1. Set environment variables: | ||
| # - LIVEKIT_URL (e.g., wss://your-project.livekit.cloud) | ||
| # - LIVEKIT_API_KEY | ||
| # - LIVEKIT_API_SECRET | ||
| # - DEEPGRAM_API_KEY (or other STT provider) | ||
| # - OPENAI_API_KEY (or other LLM provider) | ||
| # | ||
| # 2. Get a free LiveKit Cloud account at: https://cloud.livekit.io | ||
| # 3. Get a free Deepgram account at: https://console.deepgram.com | ||
| # | ||
| # Running the demo: | ||
| # python interrupt_demo.py dev | ||
| # | ||
| # Then connect via LiveKit Agents Playground: | ||
| # https://agents-playground.livekit.io/ | ||
|
|
||
| import logging | ||
|
|
||
| from dotenv import load_dotenv | ||
|
|
||
| from livekit.agents import ( | ||
| Agent, | ||
| AgentServer, | ||
| AgentSession, | ||
| JobContext, | ||
| JobProcess, | ||
| RunContext, | ||
| cli, | ||
| room_io, | ||
| ) | ||
| from livekit.agents.llm import function_tool | ||
| from livekit.plugins import silero | ||
| from livekit.plugins.turn_detector.multilingual import MultilingualModel | ||
|
|
||
| logger = logging.getLogger("interrupt-demo") | ||
|
|
||
| load_dotenv() | ||
|
|
||
|
|
||
| class DemoAgent(Agent): | ||
| """Demo agent for testing intelligent interruption handling. | ||
|
|
||
| This agent tells long stories when asked, making it easy to test | ||
| whether backchanneling words trigger interruptions. | ||
| """ | ||
|
|
||
| def __init__(self) -> None: | ||
| super().__init__( | ||
| instructions="""You are a friendly storyteller named Alex. | ||
|
|
||
| Your job is to help demonstrate the intelligent interruption handling feature. | ||
| When asked for a story, tell a LONG, engaging story (at least 3-4 paragraphs). | ||
|
|
||
| When the user says things like "yeah", "ok", "uh-huh", or "hmm" while you're | ||
| speaking, these are just acknowledgements - keep talking! | ||
|
|
||
| But if they say "stop", "wait", "hold on", or "actually" - stop and listen. | ||
|
|
||
| Keep your responses conversational but long enough to test interruptions. | ||
| Do not use emojis or special characters. Speak naturally.""", | ||
| ) | ||
|
|
||
| async def on_enter(self): | ||
| """Greet the user when the session starts.""" | ||
| self.session.generate_reply( | ||
| instructions="Greet the user and tell them you're here to tell stories. " | ||
| "Ask if they'd like to hear a story. Keep it brief - just 1-2 sentences." | ||
| ) | ||
|
|
||
| @function_tool | ||
| async def tell_story(self, context: RunContext, topic: str = "adventure"): | ||
| """Tell a story about a given topic. | ||
|
|
||
| Args: | ||
| topic: The topic or theme for the story | ||
| """ | ||
| logger.info(f"Telling a story about: {topic}") | ||
| return f"Tell a long, engaging story about {topic}. Make it at least 3-4 paragraphs." | ||
|
|
||
|
|
||
| server = AgentServer() | ||
|
|
||
|
|
||
| def prewarm(proc: JobProcess): | ||
| """Prewarm the VAD model for faster startup.""" | ||
| proc.userdata["vad"] = silero.VAD.load() | ||
|
|
||
|
|
||
| server.setup_fnc = prewarm | ||
|
|
||
|
|
||
| @server.rtc_session() | ||
| async def entrypoint(ctx: JobContext): | ||
| """Entry point for the voice agent session.""" | ||
| ctx.log_context_fields = { | ||
| "room": ctx.room.name, | ||
| } | ||
|
|
||
| # Create session with intelligent interruption handling enabled | ||
| session = AgentSession( | ||
| # Speech-to-text - Deepgram Nova 3 provides fast, accurate transcription | ||
| stt="deepgram/nova-3", | ||
|
|
||
| # LLM - GPT-4.1-mini is fast and capable for storytelling | ||
| llm="openai/gpt-4.1-mini", | ||
|
|
||
| # Text-to-speech - Cartesia Sonic 2 for natural speech | ||
| tts="cartesia/sonic-2:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc", | ||
|
|
||
| # Turn detection | ||
| turn_detection=MultilingualModel(), | ||
| vad=ctx.proc.userdata["vad"], | ||
|
|
||
| # Enable preemptive generation for lower latency | ||
| preemptive_generation=True, | ||
|
|
||
| # Enable false interruption resumption | ||
| resume_false_interruption=True, | ||
|
|
||
| # ================================================================= | ||
| # INTELLIGENT INTERRUPTION HANDLING - The feature being demonstrated | ||
| # ================================================================= | ||
|
|
||
| # Enable the interruption filter (default: True) | ||
| interruption_filter_enabled=True, | ||
|
|
||
| # Optional: Custom backchannel words to ignore when agent is speaking | ||
| # Uncomment to customize: | ||
| # backchannel_words={ | ||
| # "yeah", "yes", "yep", "ok", "okay", | ||
| # "hmm", "mhm", "uh-huh", "right", "sure", | ||
| # }, | ||
|
|
||
| # Optional: Custom words that always trigger interruption | ||
| # Uncomment to customize: | ||
| # interrupt_words={ | ||
| # "stop", "wait", "hold on", "pause", "no", | ||
| # "actually", "but", "however", | ||
| # }, | ||
| ) | ||
|
|
||
| logger.info("Starting session with intelligent interruption handling enabled") | ||
|
|
||
| await session.start( | ||
| agent=DemoAgent(), | ||
| room=ctx.room, | ||
| room_options=room_io.RoomOptions( | ||
| audio_input=room_io.AudioInputOptions(), | ||
| ), | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| cli.run_app(server) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.