Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,178 @@ agents that can see, hear, and understand.
- **MCP support**: Native support for MCP. Integrate tools provided by MCP servers with one loc.
- **Builtin test framework**: Write tests and use judges to ensure your agent is performing as expected.
- **Open-source**: Fully open-source, allowing you to run the entire stack on your own servers, including [LiveKit server](https://github.com/livekit/livekit), one of the most widely used WebRTC media servers.
- **Backchannel Filtering**: Context-aware interruption handling that distinguishes between passive acknowledgements ("yeah", "ok") and real commands.

## Backchannel Filtering

This implementation includes a **context-aware backchannel filtering** system that prevents the agent from being interrupted by passive acknowledgement words while speaking.

### How to Run the Agent

**Prerequisites:**
```bash
# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
.\venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate

# Install dependencies
pip install -e ./livekit-agents
pip install -r examples/voice_agents/requirements.txt
```

**Environment Variables (create `.env` file):**
```bash
LIVEKIT_URL=wss://your-livekit-server.com
LIVEKIT_API_KEY=your-api-key
LIVEKIT_API_SECRET=your-api-secret
OPENAI_API_KEY=your-openai-key
DEEPGRAM_API_KEY=your-deepgram-key
```

**Running the Agent:**
```bash
# Console mode (local testing without LiveKit server)
python examples/voice_agents/basic_agent.py console

# Development mode (connects to LiveKit server with hot reload)
python examples/voice_agents/basic_agent.py dev

# Production mode
python examples/voice_agents/basic_agent.py start
```

**Testing with LiveKit Playground:**
1. Run agent in dev mode: `python examples/voice_agents/basic_agent.py dev`
2. Open [Agents Playground](https://agents-playground.livekit.io/)
3. Connect with your LiveKit credentials
4. Test backchannel filtering by saying "ok" or "yeah" while agent speaks

---

### How the Interruption Logic Works

The system intercepts voice activity and transcript events **before** they trigger an interruption:

```
User speaks → VAD detects voice → STT provides transcript
Agent currently speaking?
/ \
NO YES
↓ ↓
Process normally Is transcript a backchannel?
(respond to user) / \
YES NO
↓ ↓
IGNORE completely INTERRUPT agent
(agent continues) (process command)
```

**Key Implementation Points:**

1. **Transcript-First Approach**: When agent is speaking, interruptions are deferred until STT provides a transcript. This prevents VAD from triggering premature interruptions.

2. **State-Based Filtering**: Backchannel detection ONLY applies when:
- Agent has active speech (`_current_speech is not None`)
- Speech is not already interrupted
- `allow_interruptions` is enabled

3. **Command Word Override**: If transcript contains command words (stop, wait, what, etc.), it is NEVER treated as a backchannel, even if it also contains filler words.

4. **No Audio Modification**: The filter works at the logic layer - VAD and STT are unchanged.

**Code Location:** `livekit/agents/voice/agent_activity.py` → `_interrupt_by_audio_activity()`

---

### How Ignore Words Are Configured

**Option 1: Use Default Configuration (Recommended)**
```python
session = AgentSession(
backchannel_filtering=True, # Enabled by default
)
```

**Option 2: Custom Ignore Words**
```python
session = AgentSession(
backchannel_filtering=True,
backchannel_words=frozenset({
'yeah', 'yep', 'ok', 'okay', 'hmm', 'mhm',
'uh huh', 'right', 'got it', 'sure'
}),
)
```

**Option 3: Disable Backchannel Filtering**
```python
session = AgentSession(
backchannel_filtering=False, # VAD triggers interrupt on any speech
)
```

**Default Ignored Words:**

| Category | Words |
|----------|-------|
| Affirmations | yeah, yep, ok, okay, right, alright, sure |
| Listening signals | aha, hmm, mhm, uh-huh, mm-hmm |
| Understanding | got it, gotcha, i see, oh, cool, nice, great |
| Hesitations | uh, um, er, ah |

**Command Words (Never Ignored):**

| Category | Words |
|----------|-------|
| Stop commands | stop, wait, hold, pause, halt |
| Questions | what, why, how, when, where, who |
| Negations | no, not, never |
| Redirection | actually, instead, however |

**Using BackchannelFilter Class Directly:**
```python
from livekit.agents.voice.backchannel_filter import BackchannelFilter

filter = BackchannelFilter(
backchannel_words=frozenset({'yeah', 'ok', 'custom'}),
command_words=frozenset({'stop', 'wait'}),
max_words=5
)

filter.is_backchannel("yeah") # True
filter.is_backchannel("yeah wait") # False (contains command)
filter.contains_command_words("stop") # True
```

---

### Running Unit Tests

```bash
# Run all backchannel filter tests (79 tests)
python -m pytest tests/test_backchannel_filter.py -v

# Run specific scenario tests
python -m pytest tests/test_backchannel_filter.py -v -k "scenario"

# Run with short output
python -m pytest tests/test_backchannel_filter.py --tb=short
```

### Files Modified

| File | Changes |
|------|---------|
| `livekit/agents/voice/backchannel_filter.py` | New module with `BackchannelFilter` class |
| `livekit/agents/voice/agent_activity.py` | Interruption handling with backchannel detection |
| `livekit/agents/voice/agent_session.py` | Configuration options |
| `tests/test_backchannel_filter.py` | 79 unit tests |

## Installation

Expand Down
7 changes: 7 additions & 0 deletions livekit-agents/livekit/agents/ipc/supervised_proc.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,13 @@ def _mask_ctrl_c() -> Generator[None, None, None]:
finally:
signal.pthread_sigmask(signal.SIG_UNBLOCK, [signal.SIGINT])
else:
# On Windows, signal.signal() only works in the main thread
# Check if we're in the main thread before attempting to mask
if threading.current_thread() is not threading.main_thread():
# Not in main thread, skip signal masking
yield
return

old = signal.signal(signal.SIGINT, signal.SIG_IGN)
try:
yield
Expand Down
89 changes: 75 additions & 14 deletions livekit-agents/livekit/agents/voice/agent_activity.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@
update_instructions,
)
from .speech_handle import SpeechHandle
from .backchannel_filter import is_backchannel, is_likely_continuation

if TYPE_CHECKING:
from ..llm import mcp
Expand Down Expand Up @@ -1166,33 +1167,49 @@ def _on_generation_created(self, ev: llm.GenerationCreatedEvent) -> None:
)
self._schedule_speech(handle, SpeechHandle.SPEECH_PRIORITY_NORMAL)

def _interrupt_by_audio_activity(self) -> None:
def _interrupt_by_audio_activity(self, *, speech_duration: float | None = None) -> None:
opt = self._session.options
use_pause = opt.resume_false_interruption and opt.false_interruption_timeout is not None

if isinstance(self.llm, llm.RealtimeModel) and self.llm.capabilities.turn_detection:
# ignore if realtime model has turn detection enabled
return

# Get current transcript for backchannel and word count checks
transcript = ""
if self._audio_recognition is not None:
transcript = self._audio_recognition.current_transcript

# Check if agent has an active (non-interrupted) speech
has_active_speech = (
self._current_speech is not None
and not self._current_speech.interrupted
and self._current_speech.allow_interruptions
)

# BACKCHANNEL FILTERING: When agent is speaking, require transcript before interrupting
if opt.backchannel_filtering and has_active_speech:
# If no transcript yet, defer - don't interrupt until we know what user said
if not transcript:
return

# Check if it's a backchannel - if so, ignore
if is_backchannel(transcript, custom_words=opt.backchannel_words):
return

if (
self.stt is not None
and opt.min_interruption_words > 0
and self._audio_recognition is not None
and transcript
):
text = self._audio_recognition.current_transcript

# TODO(long): better word splitting for multi-language
if len(split_words(text, split_character=True)) < opt.min_interruption_words:
if len(split_words(transcript, split_character=True)) < opt.min_interruption_words:
return

if self._rt_session is not None:
self._rt_session.start_user_activity()

if (
self._current_speech is not None
and not self._current_speech.interrupted
and self._current_speech.allow_interruptions
):
if has_active_speech:
self._paused_speech = self._current_speech

# reset the false interruption timer
Expand All @@ -1209,6 +1226,7 @@ def _interrupt_by_audio_activity(self) -> None:

self._current_speech.interrupt()


# region recognition hooks

def on_start_of_speech(self, ev: vad.VADEvent | None) -> None:
Expand Down Expand Up @@ -1241,23 +1259,50 @@ def on_vad_inference_done(self, ev: vad.VADEvent) -> None:
return

if ev.speech_duration >= self._session.options.min_interruption_duration:
self._interrupt_by_audio_activity()
self._interrupt_by_audio_activity(speech_duration=ev.speech_duration)

def _is_backchannel_during_speech(self, transcript: str) -> bool:
# Check if transcript is a backchannel while agent has active speech.
opt = self._session.options
if not opt.backchannel_filtering:
return False

has_active_speech = (
self._current_speech is not None
and not self._current_speech.interrupted
and self._current_speech.allow_interruptions
)

if not has_active_speech:
return False

return is_backchannel(transcript, custom_words=opt.backchannel_words)

def on_interim_transcript(self, ev: stt.SpeechEvent, *, speaking: bool | None) -> None:
if isinstance(self.llm, llm.RealtimeModel) and self.llm.capabilities.user_transcription:
# skip stt transcription if user_transcription is enabled on the realtime model
return

transcript = ev.alternatives[0].text

# Check if this is a backchannel while agent is speaking
is_bc = self._is_backchannel_during_speech(transcript)

# Always display the transcript on screen
self._session._user_input_transcribed(
UserInputTranscribedEvent(
language=ev.alternatives[0].language,
transcript=ev.alternatives[0].text,
transcript=transcript,
is_final=False,
speaker_id=ev.alternatives[0].speaker_id,
),
)

# Skip interruption logic for backchannels
if is_bc:
return

if ev.alternatives[0].text and self._turn_detection not in (
if transcript and self._turn_detection not in (
"manual",
"realtime_llm",
):
Expand All @@ -1276,14 +1321,25 @@ def on_final_transcript(self, ev: stt.SpeechEvent, *, speaking: bool | None = No
# skip stt transcription if user_transcription is enabled on the realtime model
return

transcript = ev.alternatives[0].text

# Check if this is a backchannel while agent is speaking
is_bc = self._is_backchannel_during_speech(transcript)

# Always display the transcript on screen
self._session._user_input_transcribed(
UserInputTranscribedEvent(
language=ev.alternatives[0].language,
transcript=ev.alternatives[0].text,
transcript=transcript,
is_final=True,
speaker_id=ev.alternatives[0].speaker_id,
),
)

# Skip interruption and turn completion for backchannels
if is_bc:
return

# agent speech might not be interrupted if VAD failed and a final transcript is received
# we call _interrupt_by_audio_activity (idempotent) to pause the speech, if possible
# which will also be immediately interrupted
Expand Down Expand Up @@ -1365,6 +1421,11 @@ def on_end_of_turn(self, info: _EndOfTurnInfo) -> bool:
# TODO(theomonnom): should we "forward" this new turn to the next agent/activity?
return True

# BACKCHANNEL FILTERING: If agent is speaking and transcript is a backchannel, ignore
if self._is_backchannel_during_speech(info.new_transcript):
self._cancel_preemptive_generation()
return False

if (
self.stt is not None
and self._turn_detection != "manual"
Expand Down
8 changes: 8 additions & 0 deletions livekit-agents/livekit/agents/voice/agent_session.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,9 @@ class AgentSessionOptions:
preemptive_generation: bool
tts_text_transforms: Sequence[TextTransforms] | None
ivr_detection: bool
# Backchannel filtering options
backchannel_filtering: bool
backchannel_words: frozenset[str] | None


Userdata_T = TypeVar("Userdata_T")
Expand Down Expand Up @@ -159,6 +162,8 @@ def __init__(
tts_text_transforms: NotGivenOr[Sequence[TextTransforms] | None] = NOT_GIVEN,
preemptive_generation: bool = False,
ivr_detection: bool = False,
backchannel_filtering: bool = True,
backchannel_words: frozenset[str] | None = None,
conn_options: NotGivenOr[SessionConnectOptions] = NOT_GIVEN,
loop: asyncio.AbstractEventLoop | None = None,
# deprecated
Expand Down Expand Up @@ -288,6 +293,8 @@ def __init__(
use_tts_aligned_transcript=use_tts_aligned_transcript
if is_given(use_tts_aligned_transcript)
else None,
backchannel_filtering=backchannel_filtering,
backchannel_words=backchannel_words,
)
self._conn_options = conn_options or SessionConnectOptions()
self._started = False
Expand Down Expand Up @@ -490,6 +497,7 @@ async def start(
return None

self._started_at = time.time()
logger.info("Agent session started at %s", self._started_at)

# configure observability first
job_ctx: JobContext | None = None
Expand Down
Loading