Ethos Studio — Emotional Speech Recognition

title	Ethos Studio
emoji	🎤
colorFrom	purple
colorTo	indigo
sdk	docker
app_port	7860
pinned	false

Ethos Studio — Emotional Speech Recognition

Built for the Mistral AI Online Hackathon 2026 (W&B Fine-Tuning Track).

Ethos Studio is a full-stack emotional speech recognition platform combining real-time transcription, facial emotion recognition, and expressive audio tagging. It turns raw speech into richly annotated transcripts with emotions, non-verbal sounds, and delivery cues.

Key Components

Evoxtral — Expressive Tagged Transcription

LoRA finetune of Voxtral-Mini-3B-2507 that produces transcriptions with inline ElevenLabs v3 audio tags. Two-stage pipeline: SFT (3 epochs) → RL via RAFT (rejection sampling, 1 epoch).

Standard ASR: So I was thinking maybe we could try that new restaurant downtown.

Evoxtral: [nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously]

Two model variants:

Evoxtral SFT — Best transcription accuracy (lowest WER)
Evoxtral RL — Best expressive tag accuracy (highest Tag F1)

Metric	Base Voxtral	Evoxtral SFT	Evoxtral RL	Best
WER ↓	6.64%	4.47%	5.12%	SFT
CER ↓	2.72%	1.23%	1.48%	SFT
Tag F1 ↑	22.0%	67.2%	69.4%	RL
Tag Recall ↑	22.0%	69.4%	72.7%	RL
Emphasis F1 ↑	42.0%	84.0%	86.0%	RL

FER — Facial Emotion Recognition

MobileViT-XXS model trained on 8 emotion classes, exported to ONNX for real-time browser inference.

Classes: Anger, Contempt, Disgust, Fear, Happy, Neutral, Sad, Surprise

Voxtral Server — Speech-to-Text + Emotion

Speech-to-text service with VAD sentence segmentation and per-segment emotion analysis, powered by Voxtral Mini 4B.

Architecture

Browser (port 3030)  →  Server layer (Node, :3000)  →  Model layer (Python, :8000)
      ↑ Studio UI            POST /api/speech-to-text          POST /transcribe
      ↑ Upload dialog        POST /api/transcribe-diarize      POST /transcribe-diarize
                             GET  /health                       GET  /health

Layer	Path	Role
Model	`model/voxtral-server`	Voxtral inference, VAD segmentation, emotion analysis
Server	`demo/server`	API entrypoint; proxies to Model
Frontend	`demo`	Next.js UI (upload, Studio editor, waveform, timeline)
Evoxtral	`training/scripts/`	Training, eval, RL, serving for expressive transcription
FER	`models/`	Facial emotion recognition ONNX model

See demo/README.md for full API and usage; model/voxtral-server/README.md for the Model API.

Project Structure

├── api/                    # Python FastAPI — local Voxtral inference + FER
├── proxy/                  # Node.js/Express — API gateway for frontend
├── web/                    # Next.js — Studio editor UI
├── training/               # Fine-tuning code (SFT + RL), data prep, eval
│   └── scripts/            # Modal scripts: train, RL (RAFT), eval, serve
├── space/                  # HuggingFace Space (Gradio demo)
├── models/                 # FER ONNX model (MobileViT-XXS)
├── docs/                   # Technical report, design docs, research refs
├── data/                   # Training data scripts (audio files gitignored)
└── Dockerfile              # Single-container HF Spaces build

How to Run

Requirements: Python 3.10+, Node.js 20+, ffmpeg; GPU recommended.

Model layer (port 8000)

cd model/voxtral-server
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Server layer (port 3000)

cd demo/server && npm install && npm run dev

Frontend (port 3030)

cd demo && npm install && npm run dev

Open http://localhost:3030.

Evoxtral API (Modal)

modal deploy training/scripts/serve_modal.py

Tech Stack

Models: Voxtral-Mini-3B + LoRA, Voxtral-Mini-4B, MobileViT-XXS
Training: PyTorch, PEFT, Weights & Biases
Inference: Modal (serverless GPU), HuggingFace ZeroGPU, ONNX Runtime
Backend: FastAPI, Node.js
Frontend: Next.js, Gradio

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
api		api
data/scripts		data/scripts
docs		docs
models		models
proxy		proxy
space		space
tests		tests
training		training
web		web
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
Evoxtral Technical Report.pdf		Evoxtral Technical Report.pdf
README.md		README.md
nginx.conf		nginx.conf
supervisord.conf		supervisord.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ethos Studio — Emotional Speech Recognition

Key Components

Evoxtral — Expressive Tagged Transcription

FER — Facial Emotion Recognition

Voxtral Server — Speech-to-Text + Emotion

Architecture

Project Structure

How to Run

Model layer (port 8000)

Server layer (port 3000)

Frontend (port 3030)

Evoxtral API (Modal)

Tech Stack

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ethos Studio — Emotional Speech Recognition

Key Components

Evoxtral — Expressive Tagged Transcription

FER — Facial Emotion Recognition

Voxtral Server — Speech-to-Text + Emotion

Architecture

Project Structure

How to Run

Model layer (port 8000)

Server layer (port 3000)

Frontend (port 3030)

Evoxtral API (Modal)

Tech Stack

Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages