Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions COMMIT_MESSAGE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
feat: implement intelligent interruption handling with configurable fuzzy matching

Add context-aware interruption detection to distinguish between backchannel
responses ("yeah", "okay", "hmm") and genuine interruptions ("stop", "wait").
This enables natural conversation flow where users can acknowledge they're
listening without disrupting the agent.

Key Features:
- Configurable fuzzy string matching using rapidfuzz (default 80% threshold)
- Handles STT typos and variations automatically ("yeahh" β†’ "yeah" @ 88%)
- Sub-millisecond performance with process.extractOne optimization
- State-aware: only filters interruptions when agent is speaking
- Robust error handling with safe fallback behavior
- 16 default backchannel words (configurable via param or env var)
- Comprehensive debug logging for production troubleshooting

Technical Implementation:
- agent_activity.py: Add _is_soft_input() and _should_ignore_interruption()
with fuzzy matching, error handling, and performance optimizations
- agent_session.py: Add DEFAULT_IGNORED_WORDS, fuzzy_match_threshold param,
and environment variable support (LIVEKIT_AGENT_IGNORED_WORDS)
- Chose fuzzy matching over semantic embeddings due to latency (<1ms vs 50-200ms)

Testing & Documentation:
- 24 comprehensive tests covering exact/fuzzy matching, edge cases, thresholds
- Demo application with usage examples and configuration display
- Complete technical specification in PLAN.md with 8-minute video script
- Interactive demonstration_walkthrough.py script with mock scenarios
- Enhanced README.md with detailed feature description
- PR_MESSAGE.md with comprehensive implementation details
- Token generation utility (generate_token.py) for LiveKit playground

Behavior Matrix:
- "yeah/okay/hmm" while speaking β†’ agent continues (backchannel)
- "yeahh/okayy" while speaking β†’ agent continues (fuzzy match)
- "wait/stop/no" while speaking β†’ agent stops (real interruption)
- "yeah but wait" while speaking β†’ agent stops (mixed input)
- Any input when silent β†’ processed normally

Configuration:
- Default: fuzzy_match_threshold=80 (balanced)
- Lenient: fuzzy_match_threshold=70 (noisy/accents)
- Strict: fuzzy_match_threshold=90 (formal/clear audio)
- Exact: fuzzy_match_threshold=100 (testing/debugging)

Breaking Changes: None (backward compatible)

Dependencies: Added rapidfuzz>=3.0.0 for fuzzy string matching

Closes: Intelligent interruption handling implementation
19 changes: 16 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,12 @@ agents that can see, hear, and understand.
- **Telephony integration**: Works seamlessly with LiveKit's [telephony stack](https://docs.livekit.io/sip/), allowing your agent to make calls to or receive calls from phones.
- **Exchange data with clients**: Use [RPCs](https://docs.livekit.io/home/client/data/rpc/) and other [Data APIs](https://docs.livekit.io/home/client/data/) to seamlessly exchange data with clients.
- **Semantic turn detection**: Uses a transformer model to detect when a user is done with their turn, helps to reduce interruptions.
- **Intelligent interruption handling**: Context-aware filtering with configurable fuzzy matching (default 80% similarity, customizable 0-100%) that distinguishes between passive acknowledgements ("yeah", "ok", "hmm") and intentional interruptions ("stop", "wait"), preventing the agent from stopping when users are just acknowledging they're listening. Features include:
- Handles typos and STT variations automatically ("yeahh" β†’ "yeah")
- Configurable similarity threshold for different use cases
- Robust error handling with automatic fallback
- Performance-optimized fuzzy matching
- Comprehensive test coverage (24 tests)
- **MCP support**: Native support for MCP. Integrate tools provided by MCP servers with one loc.
- **Builtin test framework**: Write tests and use judges to ensure your agent is performing as expected.
- **Open-source**: Fully open-source, allowing you to run the entire stack on your own servers, including [LiveKit server](https://github.com/livekit/livekit), one of the most widely used WebRTC media servers.
Expand Down Expand Up @@ -277,16 +283,23 @@ async def test_no_availability() -> None:
</p>
</td>
<td width="50%">
<h3>πŸ’¬ Text-only agent</h3>
<p>Skip voice altogether and use the same code for text-only integrations</p>
<h3>🎀 Intelligent interruption handling</h3>
<p>Agent that ignores backchannel words ("yeah", "ok") while speaking but responds to commands ("stop", "wait")</p>
<p>
<a href="examples/other/text_only.py">Code</a>
<a href="examples/voice_agents/intelligent_interruption_demo.py">Code</a>
</p>
</td>
</tr>

<tr>
<td width="50%">
<h3>πŸ’¬ Text-only agent</h3>
<p>Skip voice altogether and use the same code for text-only integrations</p>
<p>
<a href="examples/other/text_only.py">Code</a>
</p>
</td>
<td width="50%">
<h3>πŸ“ Multi-user transcriber</h3>
<p>Produce transcriptions from all users in the room</p>
<p>
Expand Down
111 changes: 111 additions & 0 deletions examples/voice_agents/intelligent_interruption_demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
"""
Intelligent Interruption Handling Demo
======================================

This example demonstrates the intelligent interruption handling feature that
distinguishes between passive acknowledgements ("yeah", "ok", "hmm") and
intentional interruptions ("stop", "wait", "no").

Key behaviors:
1. When the agent is speaking and the user says "yeah/ok/hmm" -> Agent continues uninterrupted
2. When the agent is speaking and the user says "stop/wait/no" -> Agent stops immediately
3. When the agent is silent and the user says "yeah" -> Agent responds normally

Features:
- Configurable fuzzy matching threshold (default 80%, range 0-100)
- Handles typos and misspellings ("yeahh", "okayy", "yea")
- STT transcription variations ("yah" vs "yeah")
- Common phonetic variations
- Robust error handling for fuzzy matching failures
- Case-insensitive matching
- Punctuation handling

Configuration:
# Default threshold (80%)
session = AgentSession(...)

# Stricter matching (requires closer matches)
session = AgentSession(fuzzy_match_threshold=90, ...)

# More lenient matching (allows more variations)
session = AgentSession(fuzzy_match_threshold=70, ...)

Usage:
uv run examples/voice_agents/intelligent_interruption_demo.py console

# Text mode (no microphone required)
uv run examples/voice_agents/intelligent_interruption_demo.py console --text

Environment variables:
LIVEKIT_URL - Your LiveKit server URL
LIVEKIT_API_KEY - Your LiveKit API key
LIVEKIT_API_SECRET - Your LiveKit API secret
OPENAI_API_KEY - Your OpenAI API key (for LLM and TTS)

# Optional: Customize ignored words (comma-separated)
LIVEKIT_AGENT_IGNORED_WORDS - e.g., "yeah,ok,hmm,right,uh-huh"
"""

import os
from pathlib import Path
from dotenv import load_dotenv

# Load .env from examples directory
env_path = Path(__file__).parent.parent / ".env"
load_dotenv(dotenv_path=env_path)

from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, cli
from livekit.plugins import deepgram, openai, silero


async def entrypoint(ctx: JobContext):
await ctx.connect()

# Create an agent with a long explanation prompt to test interruption handling
agent = Agent(
instructions="""You are a helpful voice assistant demonstrating intelligent interruption handling.

When asked to explain something, give a LONG, detailed explanation (at least 30 seconds of speech).
This helps demonstrate that you can continue speaking even when the user says "yeah", "ok", or "hmm"
to acknowledge they're listening.

Example topics you can explain in detail:
- The history of the internet
- How airplanes fly
- The water cycle
- Photosynthesis
- How computers work

When the user says things like "stop", "wait", "hold on", or "no", you should stop immediately
and listen to what they have to say.

Start by greeting the user and offering to explain a topic in detail.""",
)

# Create the agent session with intelligent interruption handling
# The ignored_words list can be customized here or via environment variable
# The fuzzy_match_threshold can be adjusted (default 80, range 0-100)
session = AgentSession(
vad=silero.VAD.load(),
stt=deepgram.STT(),
llm=openai.LLM(),
tts=openai.TTS(),
# Customize the ignored words list if needed (uses defaults if not specified)
# ignored_words=["yeah", "ok", "hmm", "right", "uh-huh", "mhm", "sure"],
# Customize fuzzy matching threshold (default 80)
# fuzzy_match_threshold=90, # Stricter: requires closer matches
# fuzzy_match_threshold=70, # More lenient: allows more variations
)

# Log the current configuration
print(f"\n🎀 Ignored words (backchannel): {list(session.options.ignored_words)}")
print(f"πŸ“Š Fuzzy match threshold: {session.options.fuzzy_match_threshold}%")
print(" These words will NOT interrupt the agent while it's speaking.")
print("\nπŸ’‘ Try saying 'yeah' or 'ok' while the agent is talking - it will continue!")
print(" But saying 'stop' or 'wait' will interrupt immediately.\n")

await session.start(agent=agent, room=ctx.room)


if __name__ == "__main__":
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
122 changes: 116 additions & 6 deletions livekit-agents/livekit/agents/voice/agent_activity.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@
from dataclasses import dataclass
from typing import TYPE_CHECKING, Any, Optional, Union, cast

import re
from rapidfuzz import fuzz, process

from opentelemetry import context as otel_context, trace

from livekit import rtc
Expand Down Expand Up @@ -236,6 +239,98 @@ def _validate_turn_detection(

return mode

def _is_soft_input(self, text: str) -> bool:
"""
Check if the given text consists only of ignored/backchannel words.

Uses fuzzy matching with a similarity threshold of 80% to handle:
- Typos and misspellings ("yeahh", "okayy", "yea")
- STT transcription variations ("yah" vs "yeah")
- Common phonetic variations

Returns True if all words in the text match ignored_words (exactly or fuzzily),
meaning this is likely a passive acknowledgement rather than an
intentional interruption.
"""
if not text:
return True # Empty text is considered soft

ignored_words = set(w.lower() for w in self._session.options.ignored_words)
if not ignored_words:
return False # No ignored words configured, nothing is soft

# Normalize and extract words from the transcript
normalized = text.lower().strip()
# Split on whitespace and punctuation, keep only alphanumeric words
words = re.findall(r"[a-z0-9]+(?:[-'][a-z0-9]+)?", normalized)

if not words:
return True # No actual words found

# Use configurable fuzzy matching threshold from session options
SIMILARITY_THRESHOLD = self._session.options.fuzzy_match_threshold

# Check if ALL words are in the ignored list (exact or fuzzy match)
for word in words:
# Try exact match first (faster)
if word in ignored_words:
continue

# Try fuzzy match with threshold using extractOne for efficiency
try:
best_match = process.extractOne(
word,
ignored_words,
scorer=fuzz.ratio,
score_cutoff=SIMILARITY_THRESHOLD
)

if not best_match:
return False # Found a word that doesn't match any ignored word
except Exception as e:
# If fuzzy matching fails, log error and fall back to allowing interruption
logger.error(
"fuzzy matching failed, allowing interruption",
exc_info=e,
extra={"word": word, "transcript": text}
)
return False # Safely allow interruption on error

logger.debug(
"soft input detected, ignoring interruption",
extra={"transcript": text, "words": words}
)
return True # All words matched (exactly or fuzzily)

def _should_ignore_interruption(self, transcript: str | None = None) -> bool:
"""
Determine if an interruption should be ignored based on agent state and transcript.

Returns True if:
- Agent is currently speaking (has active, uninterrupted speech)
- The transcript (if available) consists only of backchannel words
"""
# Agent must be speaking and have active, non-interrupted speech
if (
self._current_speech is None
or self._current_speech.interrupted
or not self._current_speech.allow_interruptions
):
return False # Agent is not speaking or already interrupted

# If we have a transcript, check if it's soft input
if transcript is not None:
return self._is_soft_input(transcript)

# If STT is available, check the current transcript from audio recognition
if self._audio_recognition is not None:
current = self._audio_recognition.current_transcript
if current:
return self._is_soft_input(current)

# No transcript available - cannot determine, allow interruption
return False

@property
def scheduling_paused(self) -> bool:
return self._scheduling_paused
Expand Down Expand Up @@ -1166,14 +1261,23 @@ def _on_generation_created(self, ev: llm.GenerationCreatedEvent) -> None:
)
self._schedule_speech(handle, SpeechHandle.SPEECH_PRIORITY_NORMAL)

def _interrupt_by_audio_activity(self) -> None:
def _interrupt_by_audio_activity(self, *, transcript: str | None = None) -> None:
opt = self._session.options
use_pause = opt.resume_false_interruption and opt.false_interruption_timeout is not None

if isinstance(self.llm, llm.RealtimeModel) and self.llm.capabilities.turn_detection:
# ignore if realtime model has turn detection enabled
return

# Check for soft/backchannel input - if the agent is speaking and the
# user only said ignored words, skip the interruption entirely
if self._should_ignore_interruption(transcript):
logger.debug(
"ignoring soft input while agent is speaking",
extra={"transcript": transcript or self._audio_recognition.current_transcript if self._audio_recognition else ""},
)
return

if (
self.stt is not None
and opt.min_interruption_words > 0
Expand Down Expand Up @@ -1248,20 +1352,23 @@ def on_interim_transcript(self, ev: stt.SpeechEvent, *, speaking: bool | None) -
# skip stt transcription if user_transcription is enabled on the realtime model
return

transcript_text = ev.alternatives[0].text

self._session._user_input_transcribed(
UserInputTranscribedEvent(
language=ev.alternatives[0].language,
transcript=ev.alternatives[0].text,
transcript=transcript_text,
is_final=False,
speaker_id=ev.alternatives[0].speaker_id,
),
)

if ev.alternatives[0].text and self._turn_detection not in (
if transcript_text and self._turn_detection not in (
"manual",
"realtime_llm",
):
self._interrupt_by_audio_activity()
# Pass the transcript to enable soft input detection
self._interrupt_by_audio_activity(transcript=transcript_text)

if (
speaking is False
Expand All @@ -1276,10 +1383,12 @@ def on_final_transcript(self, ev: stt.SpeechEvent, *, speaking: bool | None = No
# skip stt transcription if user_transcription is enabled on the realtime model
return

transcript_text = ev.alternatives[0].text

self._session._user_input_transcribed(
UserInputTranscribedEvent(
language=ev.alternatives[0].language,
transcript=ev.alternatives[0].text,
transcript=transcript_text,
is_final=True,
speaker_id=ev.alternatives[0].speaker_id,
),
Expand All @@ -1292,7 +1401,8 @@ def on_final_transcript(self, ev: stt.SpeechEvent, *, speaking: bool | None = No
"manual",
"realtime_llm",
):
self._interrupt_by_audio_activity()
# Pass the transcript to enable soft input detection
self._interrupt_by_audio_activity(transcript=transcript_text)

if (
speaking is False
Expand Down
Loading