Dark-Sys-Jenkins · sh4shv4t · Jan 30, 2026 · Feb 2, 2026
diff --git a/COMMIT_MESSAGE.txt b/COMMIT_MESSAGE.txt
@@ -0,0 +1,50 @@
+feat: implement intelligent interruption handling with configurable fuzzy matching
+
+Add context-aware interruption detection to distinguish between backchannel
+responses ("yeah", "okay", "hmm") and genuine interruptions ("stop", "wait").
+This enables natural conversation flow where users can acknowledge they're
+listening without disrupting the agent.
+
+Key Features:
+- Configurable fuzzy string matching using rapidfuzz (default 80% threshold)
+- Handles STT typos and variations automatically ("yeahh" → "yeah" @ 88%)
+- Sub-millisecond performance with process.extractOne optimization
+- State-aware: only filters interruptions when agent is speaking
+- Robust error handling with safe fallback behavior
+- 16 default backchannel words (configurable via param or env var)
+- Comprehensive debug logging for production troubleshooting
+
+Technical Implementation:
+- agent_activity.py: Add _is_soft_input() and _should_ignore_interruption()
+  with fuzzy matching, error handling, and performance optimizations
+- agent_session.py: Add DEFAULT_IGNORED_WORDS, fuzzy_match_threshold param,
+  and environment variable support (LIVEKIT_AGENT_IGNORED_WORDS)
+- Chose fuzzy matching over semantic embeddings due to latency (<1ms vs 50-200ms)
+
+Testing & Documentation:
+- 24 comprehensive tests covering exact/fuzzy matching, edge cases, thresholds
+- Demo application with usage examples and configuration display
+- Complete technical specification in PLAN.md with 8-minute video script
+- Interactive demonstration_walkthrough.py script with mock scenarios
+- Enhanced README.md with detailed feature description
+- PR_MESSAGE.md with comprehensive implementation details
+- Token generation utility (generate_token.py) for LiveKit playground
+
+Behavior Matrix:
+- "yeah/okay/hmm" while speaking → agent continues (backchannel)
+- "yeahh/okayy" while speaking → agent continues (fuzzy match)
+- "wait/stop/no" while speaking → agent stops (real interruption)
+- "yeah but wait" while speaking → agent stops (mixed input)
+- Any input when silent → processed normally
+
+Configuration:
+- Default: fuzzy_match_threshold=80 (balanced)
+- Lenient: fuzzy_match_threshold=70 (noisy/accents)
+- Strict: fuzzy_match_threshold=90 (formal/clear audio)
+- Exact: fuzzy_match_threshold=100 (testing/debugging)
+
+Breaking Changes: None (backward compatible)
+
+Dependencies: Added rapidfuzz>=3.0.0 for fuzzy string matching
+
+Closes: Intelligent interruption handling implementation
diff --git a/README.md b/README.md
@@ -38,6 +38,12 @@ agents that can see, hear, and understand.
 - **Telephony integration**: Works seamlessly with LiveKit's [telephony stack](https://docs.livekit.io/sip/), allowing your agent to make calls to or receive calls from phones.
 - **Exchange data with clients**: Use [RPCs](https://docs.livekit.io/home/client/data/rpc/) and other [Data APIs](https://docs.livekit.io/home/client/data/) to seamlessly exchange data with clients.
 - **Semantic turn detection**: Uses a transformer model to detect when a user is done with their turn, helps to reduce interruptions.
+- **Intelligent interruption handling**: Context-aware filtering with configurable fuzzy matching (default 80% similarity, customizable 0-100%) that distinguishes between passive acknowledgements ("yeah", "ok", "hmm") and intentional interruptions ("stop", "wait"), preventing the agent from stopping when users are just acknowledging they're listening. Features include:
+  - Handles typos and STT variations automatically ("yeahh" → "yeah")
+  - Configurable similarity threshold for different use cases
+  - Robust error handling with automatic fallback
+  - Performance-optimized fuzzy matching
+  - Comprehensive test coverage (24 tests)
 - **MCP support**: Native support for MCP. Integrate tools provided by MCP servers with one loc.
 - **Builtin test framework**: Write tests and use judges to ensure your agent is performing as expected.
 - **Open-source**: Fully open-source, allowing you to run the entire stack on your own servers, including [LiveKit server](https://github.com/livekit/livekit), one of the most widely used WebRTC media servers.
@@ -277,16 +283,23 @@ async def test_no_availability() -> None:
 </p>
 </td>
 <td width="50%">
-<h3>💬 Text-only agent</h3>
-<p>Skip voice altogether and use the same code for text-only integrations</p>
+<h3>🎤 Intelligent interruption handling</h3>
+<p>Agent that ignores backchannel words ("yeah", "ok") while speaking but responds to commands ("stop", "wait")</p>
 <p>
-<a href="examples/other/text_only.py">Code</a>
+<a href="examples/voice_agents/intelligent_interruption_demo.py">Code</a>
 </p>
 </td>
 </tr>
 
 <tr>
 <td width="50%">
+<h3>💬 Text-only agent</h3>
+<p>Skip voice altogether and use the same code for text-only integrations</p>
+<p>
+<a href="examples/other/text_only.py">Code</a>
+</p>
+</td>
+<td width="50%">
 <h3>📝 Multi-user transcriber</h3>
 <p>Produce transcriptions from all users in the room</p>
 <p>

diff --git a/examples/voice_agents/intelligent_interruption_demo.py b/examples/voice_agents/intelligent_interruption_demo.py
@@ -0,0 +1,111 @@
+"""
+Intelligent Interruption Handling Demo
+======================================
+
+This example demonstrates the intelligent interruption handling feature that
+distinguishes between passive acknowledgements ("yeah", "ok", "hmm") and
+intentional interruptions ("stop", "wait", "no").
+
+Key behaviors:
+1. When the agent is speaking and the user says "yeah/ok/hmm" -> Agent continues uninterrupted
+2. When the agent is speaking and the user says "stop/wait/no" -> Agent stops immediately  
+3. When the agent is silent and the user says "yeah" -> Agent responds normally
+
+Features:
+- Configurable fuzzy matching threshold (default 80%, range 0-100)
+- Handles typos and misspellings ("yeahh", "okayy", "yea")
+- STT transcription variations ("yah" vs "yeah")
+- Common phonetic variations
+- Robust error handling for fuzzy matching failures
+- Case-insensitive matching
+- Punctuation handling
+
+Configuration:
+    # Default threshold (80%)
+    session = AgentSession(...)
+
+    # Stricter matching (requires closer matches)
+    session = AgentSession(fuzzy_match_threshold=90, ...)
+
+    # More lenient matching (allows more variations)
+    session = AgentSession(fuzzy_match_threshold=70, ...)
+
+Usage:
+    uv run examples/voice_agents/intelligent_interruption_demo.py console
+
+    # Text mode (no microphone required)
+    uv run examples/voice_agents/intelligent_interruption_demo.py console --text
+
+Environment variables:
+    LIVEKIT_URL - Your LiveKit server URL
+    LIVEKIT_API_KEY - Your LiveKit API key
+    LIVEKIT_API_SECRET - Your LiveKit API secret
+    OPENAI_API_KEY - Your OpenAI API key (for LLM and TTS)
+
+    # Optional: Customize ignored words (comma-separated)
+    LIVEKIT_AGENT_IGNORED_WORDS - e.g., "yeah,ok,hmm,right,uh-huh"
+"""
+
+import os
+from pathlib import Path
+from dotenv import load_dotenv
+
+# Load .env from examples directory
+env_path = Path(__file__).parent.parent / ".env"
+load_dotenv(dotenv_path=env_path)
+
+from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, cli
+from livekit.plugins import deepgram, openai, silero
+
+
+async def entrypoint(ctx: JobContext):
+    await ctx.connect()
+
+    # Create an agent with a long explanation prompt to test interruption handling
+    agent = Agent(
+        instructions="""You are a helpful voice assistant demonstrating intelligent interruption handling.
+
+When asked to explain something, give a LONG, detailed explanation (at least 30 seconds of speech).
+This helps demonstrate that you can continue speaking even when the user says "yeah", "ok", or "hmm"
+to acknowledge they're listening.
+
+Example topics you can explain in detail:
+- The history of the internet
+- How airplanes fly
+- The water cycle
+- Photosynthesis
+- How computers work
+
+When the user says things like "stop", "wait", "hold on", or "no", you should stop immediately
+and listen to what they have to say.
+
+Start by greeting the user and offering to explain a topic in detail.""",
+    )
+
+    # Create the agent session with intelligent interruption handling
+    # The ignored_words list can be customized here or via environment variable
+    # The fuzzy_match_threshold can be adjusted (default 80, range 0-100)
+    session = AgentSession(
+        vad=silero.VAD.load(),
+        stt=deepgram.STT(),
+        llm=openai.LLM(),
+        tts=openai.TTS(),
+        # Customize the ignored words list if needed (uses defaults if not specified)
+        # ignored_words=["yeah", "ok", "hmm", "right", "uh-huh", "mhm", "sure"],
+        # Customize fuzzy matching threshold (default 80)
+        # fuzzy_match_threshold=90,  # Stricter: requires closer matches
+        # fuzzy_match_threshold=70,  # More lenient: allows more variations
+    )
+
+    # Log the current configuration
+    print(f"\n🎤 Ignored words (backchannel): {list(session.options.ignored_words)}")
+    print(f"📊 Fuzzy match threshold: {session.options.fuzzy_match_threshold}%")
+    print("   These words will NOT interrupt the agent while it's speaking.")
+    print("\n💡 Try saying 'yeah' or 'ok' while the agent is talking - it will continue!")
+    print("   But saying 'stop' or 'wait' will interrupt immediately.\n")
+
+    await session.start(agent=agent, room=ctx.room)
+
+
+if __name__ == "__main__":
+    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
diff --git a/livekit-agents/livekit/agents/voice/agent_activity.py b/livekit-agents/livekit/agents/voice/agent_activity.py
@@ -9,6 +9,9 @@
 from dataclasses import dataclass
 from typing import TYPE_CHECKING, Any, Optional, Union, cast
 
+import re
+from rapidfuzz import fuzz, process
+
 from opentelemetry import context as otel_context, trace
 
 from livekit import rtc
@@ -236,6 +239,98 @@ def _validate_turn_detection(
 
         return mode
 
+    def _is_soft_input(self, text: str) -> bool:
+        """
+        Check if the given text consists only of ignored/backchannel words.
+
+        Uses fuzzy matching with a similarity threshold of 80% to handle:
+        - Typos and misspellings ("yeahh", "okayy", "yea")
+        - STT transcription variations ("yah" vs "yeah")
+        - Common phonetic variations
+
+        Returns True if all words in the text match ignored_words (exactly or fuzzily),
+        meaning this is likely a passive acknowledgement rather than an
+        intentional interruption.
+        """
+        if not text:
+            return True  # Empty text is considered soft
+
+        ignored_words = set(w.lower() for w in self._session.options.ignored_words)
+        if not ignored_words:
+            return False  # No ignored words configured, nothing is soft
+
+        # Normalize and extract words from the transcript
+        normalized = text.lower().strip()
+        # Split on whitespace and punctuation, keep only alphanumeric words
+        words = re.findall(r"[a-z0-9]+(?:[-'][a-z0-9]+)?", normalized)
+
+        if not words:
+            return True  # No actual words found
+
+        # Use configurable fuzzy matching threshold from session options
+        SIMILARITY_THRESHOLD = self._session.options.fuzzy_match_threshold
+
+        # Check if ALL words are in the ignored list (exact or fuzzy match)
+        for word in words:
+            # Try exact match first (faster)
+            if word in ignored_words:
+                continue
+
+            # Try fuzzy match with threshold using extractOne for efficiency
+            try:
+                best_match = process.extractOne(
+                    word, 
+                    ignored_words, 
+                    scorer=fuzz.ratio,
+                    score_cutoff=SIMILARITY_THRESHOLD
+                )
+
+                if not best_match:
+                    return False  # Found a word that doesn't match any ignored word
+            except Exception as e:
+                # If fuzzy matching fails, log error and fall back to allowing interruption
+                logger.error(
+                    "fuzzy matching failed, allowing interruption",
+                    exc_info=e,
+                    extra={"word": word, "transcript": text}
+                )
+                return False  # Safely allow interruption on error
+
+        logger.debug(
+            "soft input detected, ignoring interruption",
+            extra={"transcript": text, "words": words}
+        )
+        return True  # All words matched (exactly or fuzzily)
+
+    def _should_ignore_interruption(self, transcript: str | None = None) -> bool:
+        """
+        Determine if an interruption should be ignored based on agent state and transcript.
+
+        Returns True if:
+        - Agent is currently speaking (has active, uninterrupted speech)
+        - The transcript (if available) consists only of backchannel words
+        """
+        # Agent must be speaking and have active, non-interrupted speech
+        if (
+            self._current_speech is None
+            or self._current_speech.interrupted
+            or not self._current_speech.allow_interruptions
+        ):
+            return False  # Agent is not speaking or already interrupted
+
+        # If we have a transcript, check if it's soft input
+        if transcript is not None:
+            return self._is_soft_input(transcript)
+
+        # If STT is available, check the current transcript from audio recognition
+        if self._audio_recognition is not None:
+            current = self._audio_recognition.current_transcript
+            if current:
+                return self._is_soft_input(current)
+
+        # No transcript available - cannot determine, allow interruption
+        return False
+
     @property
     def scheduling_paused(self) -> bool:
         return self._scheduling_paused
@@ -1166,14 +1261,23 @@ def _on_generation_created(self, ev: llm.GenerationCreatedEvent) -> None:
         )
         self._schedule_speech(handle, SpeechHandle.SPEECH_PRIORITY_NORMAL)
 
-    def _interrupt_by_audio_activity(self) -> None:
+    def _interrupt_by_audio_activity(self, *, transcript: str | None = None) -> None:
         opt = self._session.options
         use_pause = opt.resume_false_interruption and opt.false_interruption_timeout is not None
 
         if isinstance(self.llm, llm.RealtimeModel) and self.llm.capabilities.turn_detection:
             # ignore if realtime model has turn detection enabled
             return
 
+        # Check for soft/backchannel input - if the agent is speaking and the
+        # user only said ignored words, skip the interruption entirely
+        if self._should_ignore_interruption(transcript):
+            logger.debug(
+                "ignoring soft input while agent is speaking",
+                extra={"transcript": transcript or self._audio_recognition.current_transcript if self._audio_recognition else ""},
+            )
+            return
+
         if (
             self.stt is not None
             and opt.min_interruption_words > 0
@@ -1248,20 +1352,23 @@ def on_interim_transcript(self, ev: stt.SpeechEvent, *, speaking: bool | None) -
             # skip stt transcription if user_transcription is enabled on the realtime model
             return
 
+        transcript_text = ev.alternatives[0].text
+
         self._session._user_input_transcribed(
             UserInputTranscribedEvent(
                 language=ev.alternatives[0].language,
-                transcript=ev.alternatives[0].text,
+                transcript=transcript_text,
                 is_final=False,
                 speaker_id=ev.alternatives[0].speaker_id,
             ),
         )
 
-        if ev.alternatives[0].text and self._turn_detection not in (
+        if transcript_text and self._turn_detection not in (
             "manual",
             "realtime_llm",
         ):
-            self._interrupt_by_audio_activity()
+            # Pass the transcript to enable soft input detection
+            self._interrupt_by_audio_activity(transcript=transcript_text)
 
             if (
                 speaking is False
@@ -1276,10 +1383,12 @@ def on_final_transcript(self, ev: stt.SpeechEvent, *, speaking: bool | None = No
             # skip stt transcription if user_transcription is enabled on the realtime model
             return
 
+        transcript_text = ev.alternatives[0].text
+
         self._session._user_input_transcribed(
             UserInputTranscribedEvent(
                 language=ev.alternatives[0].language,
-                transcript=ev.alternatives[0].text,
+                transcript=transcript_text,
                 is_final=True,
                 speaker_id=ev.alternatives[0].speaker_id,
             ),
@@ -1292,7 +1401,8 @@ def on_final_transcript(self, ev: stt.SpeechEvent, *, speaking: bool | None = No
             "manual",
             "realtime_llm",
         ):
-            self._interrupt_by_audio_activity()
+            # Pass the transcript to enable soft input detection
+            self._interrupt_by_audio_activity(transcript=transcript_text)
 
             if (
                 speaking is False