Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
.git
.venv
.env
__pycache__
*.pyc
*.pyo
*.egg-info
dist
build
.pytest_cache
.mypy_cache
.ruff_cache
tests/
*.md
screenshot.png
39 changes: 39 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Copy this file to .env and edit to suit your system.
# .env is gitignored — .env.example is the reference.

# ── GPU ───────────────────────────────────────────────────────────────────────

# Which GPU(s) ROCm should use. "0" = first GPU only (recommended for
# multi-GPU systems with unmatched cards, avoids imbalance segfaults).
# Set to "0,1" to use both GPUs if they are the same model.
HIP_VISIBLE_DEVICES=0

# Enables Flash Efficient and Memory Efficient attention on RDNA3+ GPUs (RX 7000 / 9000).
# Set to empty string to disable (PyTorch will log a warning and use the slow path).
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

# Set ONLY if your GPU's gfx version isn't natively supported by ROCm.
# WARNING: Do NOT set this to an empty string — an empty string is not the
# same as unset and will cause ROCm to fail. Only uncomment if needed.
# Check your GPU: rocm-smi --showproductname
# RX 6000 series (RDNA2): 10.3.0
# RX 7000 series (RDNA3): 11.0.0
# HSA_OVERRIDE_GFX_VERSION=10.3.0

# ── paths (override if your data lives elsewhere) ─────────────────────────────

# Host directory containing audio_clips/ and manifest.jsonl
# Default: ~/.listenr
# LISTENR_DATA=/home/you/.listenr

# Host directory for train/dev/test dataset splits (written by build-dataset)
# Default: ~/listenr_dataset
# LISTENR_DATASET=/home/you/listenr_dataset

# Host directory for LoRA adapter checkpoints (written by finetune)
# Default: ~/listenr_finetune
# LISTENR_FINETUNE=/home/you/listenr_finetune

# HuggingFace model cache (shared with host to avoid re-downloads)
# Default: ~/.cache/huggingface
# HF_CACHE=/home/you/.cache/huggingface
67 changes: 67 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Listenr — AMD ROCm fine-tuning image
#
# Base: official AMD-tested ROCm 7.2 + PyTorch 2.9.1 image (Python 3.12).
# Ref: https://rocm.docs.amd.com/en/latest/how_to/pytorch_install/pytorch_install.html
#
# Pull: podman pull rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1
# Build: podman build -t listenr-rocm .
# Run: see docs/finetune-amd.md
#
# NOTE: sounddevice (microphone capture) will not work inside this container.
# This image is intended for fine-tuning only (listenr-finetune /
# listenr-build-dataset), not real-time audio capture.
#
# NOTE: listenr requires Python >=3.13 for local installs; this image uses
# Python 3.12 (the AMD-tested version). --ignore-requires-python is safe
# here — the codebase uses no 3.13-only syntax.

FROM rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1

# ── system packages ──────────────────────────────────────────────────────────
# libsndfile1 : required by soundfile (audio I/O in finetune data pipeline)
# ffmpeg : optional but useful for converting audio files
#
# IMPORTANT: We must not upgrade libdrm, mesa, or any ROCm library — doing so
# breaks the GPU stack that ships with the base image. Use --no-upgrade to
# install only what is missing (both packages are typically absent from the
# base image but their deps like libdrm are already present).
RUN apt-get update && apt-get install -y --no-install-recommends --no-upgrade \
libsndfile1 \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*

# ── project install ──────────────────────────────────────────────────────────
WORKDIR /app
COPY . /app

# Freeze the ROCm-aware torch/torchvision/torchaudio/triton that ship in the
# base image before installing finetune extras. Without this, pip resolves
# transformers' torch dependency and pulls a CPU-only build from PyPI.
# pip show works regardless of whether torch was installed via URL or wheel.
RUN pip show torch torchvision torchaudio triton 2>/dev/null \
| awk '/^Name:/{name=$2} /^Version:/{print name "==" $2}' \
> /tmp/torch-constraints.txt \
&& cat /tmp/torch-constraints.txt

# Install core + finetune extras, pinning torch to the ROCm version above.
# --ignore-requires-python: base image is Python 3.12; constraint is >=3.13.
RUN pip install --no-cache-dir \
--ignore-requires-python \
--constraint /tmp/torch-constraints.txt \
-e ".[finetune]"

# ── runtime defaults ─────────────────────────────────────────────────────────
# Pin to GPU 0 by default to avoid imbalance crashes on multi-GPU systems.
# Override at runtime: -e HIP_VISIBLE_DEVICES=0,1
#
# Do NOT set HSA_OVERRIDE_GFX_VERSION here — an empty string is not the same
# as unset and causes ROCm to fail. Set it at runtime only if your GPU needs
# it (e.g. -e HSA_OVERRIDE_GFX_VERSION=10.3.0 for RX 6000 series).
#
# TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL: enables Flash Efficient and
# Memory Efficient attention on newer AMD GPUs (RDNA 3 / RDNA 4). Without
# this, PyTorch logs a UserWarning and falls back to a slower implementation.
ENV HIP_VISIBLE_DEVICES="0" \
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL="1"

CMD ["listenr-finetune", "--help"]
191 changes: 18 additions & 173 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,193 +11,39 @@ Listenr is a privacy-first tool for collecting real-world audio and high-quality
- **Open models.** Uses Whisper.cpp for transcription and any GGUF-compatible LLM for post-processing correction.
- **Automatic correction pipeline.** A local LLM cleans up punctuation, grammar, and homophones — producing a higher-quality training corpus than raw Whisper output alone.
- **Real-world data.** Collects natural, conversational speech in realistic environments.
- **Dataset-ready output.** Every utterance is saved with its audio clip, a per-clip JSON, and appended to a single `manifest.jsonl`. One command builds train/dev/test splits.
- **Dataset-ready output.** Every utterance is saved with its audio clip and appended to a single `manifest.jsonl`. One command builds train/dev/test splits.

## How It Works

1. **Capture.** `listenr` streams your microphone to Lemonade's `/realtime` WebSocket in ~85 ms chunks. Audio is captured at the device's native rate and resampled to 16 kHz before sending.
2. **VAD.** Lemonade's built-in server-side voice activity detection segments speech boundaries automatically.
3. **Transcribe.** Lemonade runs Whisper.cpp on each speech segment and streams back interim and final transcripts.
4. **Correct (optional).** The final transcript is sent to a local LLM via Lemonade's chat completions API. The LLM returns a cleaned transcript, an `is_improved` flag, and content `categories`.
5. **Save.** Each utterance is saved as a `.wav` clip and appended to `manifest.jsonl`.
6. **Build dataset.** `build_dataset.py` reads the manifest and writes train/dev/test CSV splits.
1. **Capture** `listenr` streams your microphone to Lemonade's `/realtime` WebSocket in ~85 ms chunks, resampled to 16 kHz.
2. **VAD** Lemonade's built-in voice activity detection segments speech boundaries automatically.
3. **Transcribe** Lemonade runs Whisper.cpp on each segment and streams back transcripts.
4. **Correct (optional)** a local LLM cleans the transcript and tags content categories.
5. **Save** — each utterance is saved as a `.wav` clip and a line in `manifest.jsonl`.
6. **Build dataset** — `listenr-build-dataset` writes train/dev/test splits from the manifest.



## Requirements

- [Lemonade Server](https://lemonade-server.ai) running on `localhost:8000`
- Python 3.13+ with `uv` (recommended) or `pip`
- A microphone accessible via PipeWire or ALSA

## Installation
## Quick Start

```bash
git clone https://github.com/Rebreda/listenr
cd listenr
uv pip install -e .
```

Then run commands via `uv run` (no activation needed):

```bash
lemonade-server serve # in another terminal
uv run listenr
```

Or activate the venv once per session:
See [docs/setup.md](docs/setup.md) for full installation instructions.

```bash
source .venv/bin/activate
listenr
```
## Documentation

## Start Lemonade Server

```bash
lemonade-server serve
```

Listenr will automatically call `POST /api/v1/load` on startup to load the configured models. On first use, Lemonade will download them.

## Usage

### CLI — Real-Time Microphone Capture

```bash
# Record and save everything (default)
uv run listenr

# Don't save to disk — just print transcriptions
uv run listenr --no-save

# Also print the raw Whisper output before LLM correction
uv run listenr --show-raw

# Verbose debug output (WebSocket messages, mic RMS, etc.)
uv run listenr --debug
```

Example output:

```
🎤 Listenr CLI — streaming to Lemonade
Model : Whisper-Large-v3-Turbo
WS URL : ws://localhost:9000/realtime?model=Whisper-Large-v3-Turbo
LLM : enabled (gpt-oss-20b-mxfp4-GGUF)
Save : yes → ~/.listenr/audio_clips
Press Ctrl+C to stop.

[ASR] I'm going to the store to buy some milk. [dictation]
[SAVED] ~/.listenr/audio_clips/audio/2026-02-28/clip_2026-02-28_abc123.wav (2.4s)
```

Press **Ctrl+C** to stop. Listenr will unload all models from Lemonade before exiting.

### Build a Dataset

After collecting recordings, generate train/dev/test splits from `manifest.jsonl`:

```bash
# Default: 80/10/10 CSV splits in ~/listenr_dataset/
uv run listenr-build-dataset

# Custom output directory and split ratio
uv run listenr-build-dataset --output ~/my_dataset --split 90/5/5

# Exclude very short clips
uv run listenr-build-dataset --min-duration 1.0

# HuggingFace datasets format
uv run listenr-build-dataset --format hf

# Preview stats without writing files
uv run listenr-build-dataset --dry-run
```

Output CSV columns: `uuid`, `split`, `audio_path`, `raw_transcription`, `corrected_transcription`, `is_improved`, `categories`, `duration_s`, `sample_rate`, `whisper_model`, `llm_model`, `timestamp`.

### Batch Transcription

Transcribe a single audio file:

```bash
python -m listenr.unified_asr --audio path/to/audio.wav --whisper-model Whisper-Large-v3-Turbo

# With LLM correction
python -m listenr.unified_asr --llm --audio path/to/audio.wav
```

## Configuration

Config is created with defaults at `~/.config/listenr/config.ini` on first run.


### Finding your input device

```bash
python -c "import sounddevice as sd; [print(f'{i}: {d[\"name\"]}') for i, d in enumerate(sd.query_devices()) if d['max_input_channels'] > 0]"
```

Set `input_device` to the device name (partial match works) or its index number.

### VAD Tuning

| Goal | Setting |
| Guide | Description |
|---|---|
| Shorter segments | Lower `silence_duration_ms` (e.g. `500`) |
| Avoid cutting off speech | Raise `silence_duration_ms` (e.g. `1200`) |
| Ignore background noise | Raise `threshold` (e.g. `0.05`) |
| Capture quiet speech | Lower `threshold` (e.g. `0.005`) |


### manifest.jsonl

One JSON object per line — append-only, easy to query:

```bash
# All improved clips
jq 'select(.is_improved == true)' ~/.listenr/audio_clips/manifest.jsonl

# Clips tagged as commands
jq 'select(.categories[] == "command")' ~/.listenr/audio_clips/manifest.jsonl

# Load into pandas
python -c "import pandas as pd; df = pd.read_json('~/.listenr/audio_clips/manifest.jsonl', lines=True); print(df.head())"
```

### manifest.jsonl

**No transcriptions appear / `[SAVE SKIPPED] pcm_buffer is empty`**
- Check that Lemonade is running: `curl http://localhost:8000/api/v1/health`
- Run with `--debug` to see mic RMS values and WebSocket messages
- If RMS stays near `0.000`, your `input_device` is wrong — list devices and update config (see above)
- Lower `threshold` in `[VAD]` if your mic is quiet

**LLM correction not working / model answers the transcription instead of fixing it**
- Confirm `LLM.enabled = true` and the model name matches one loaded in Lemonade
- Check `curl http://localhost:8000/api/v1/models` to see loaded models
- LLM errors are non-fatal — the raw transcript is saved regardless

**`Could not discover Lemonade websocket port`**
Lemonade is not running or not reachable on port 8000. Run `lemonade-server serve` first.

**Too many / too few segments**
Adjust `[VAD] silence_duration_ms` and `threshold` in your config.

## Available Models (via Lemonade)

| Model | Type | Notes |
|---|---|---|
| `Whisper-Base` | ASR | Fast, lower accuracy |
| `Whisper-Large-v3-Turbo` | ASR | Best accuracy |
| `gpt-oss-20b-mxfp4-GGUF` | LLM | Good correction quality |
| `Gemma-3-4b-it-GGUF` | LLM | Lighter alternative |
| `DeepSeek-Qwen3-8B-GGUF` | LLM | Lighter alternative |

List all models available on your Lemonade instance:
```bash
curl -s http://localhost:8000/api/v1/models | python3 -c "import sys,json; [print(m['id']) for m in json.load(sys.stdin)['data']]"
```
| [docs/setup.md](docs/setup.md) | Installation, Lemonade Server, microphone setup |
| [docs/configuration.md](docs/configuration.md) | Full `config.ini` reference, VAD tuning, available models |
| [docs/recording.md](docs/recording.md) | CLI usage, how recording works, batch transcription |
| [docs/dataset.md](docs/dataset.md) | Building train/dev/test splits, CSV and HF formats |
| [docs/finetune-amd.md](docs/finetune-amd.md) | Fine-tuning Whisper on AMD GPU via ROCm + Podman |
| [docs/troubleshooting.md](docs/troubleshooting.md) | Common errors and fixes |

## License

Expand All @@ -208,4 +54,3 @@ Mozilla Public License Version 2.0 — see `LICENSE`.
- [Lemonade Server](https://lemonade-server.ai) — unified local inference API
- [whisper.cpp](https://github.com/ggerganov/whisper.cpp) — fast local ASR
- [llama.cpp](https://github.com/ggerganov/llama.cpp) — fast local LLMs

Loading