| title | Ethos Studio |
|---|---|
| emoji | 🎤 |
| colorFrom | purple |
| colorTo | indigo |
| sdk | docker |
| app_port | 7860 |
| pinned | false |
Built for the Mistral AI Online Hackathon 2026 (W&B Fine-Tuning Track).
Ethos Studio is a full-stack emotional speech recognition platform combining real-time transcription, facial emotion recognition, and expressive audio tagging. It turns raw speech into richly annotated transcripts with emotions, non-verbal sounds, and delivery cues.
LoRA finetune of Voxtral-Mini-3B-2507 that produces transcriptions with inline ElevenLabs v3 audio tags. Two-stage pipeline: SFT (3 epochs) → RL via RAFT (rejection sampling, 1 epoch).
Standard ASR: So I was thinking maybe we could try that new restaurant downtown.
Evoxtral: [nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously]
Two model variants:
- Evoxtral SFT — Best transcription accuracy (lowest WER)
- Evoxtral RL — Best expressive tag accuracy (highest Tag F1)
| Metric | Base Voxtral | Evoxtral SFT | Evoxtral RL | Best |
|---|---|---|---|---|
| WER ↓ | 6.64% | 4.47% | 5.12% | SFT |
| CER ↓ | 2.72% | 1.23% | 1.48% | SFT |
| Tag F1 ↑ | 22.0% | 67.2% | 69.4% | RL |
| Tag Recall ↑ | 22.0% | 69.4% | 72.7% | RL |
| Emphasis F1 ↑ | 42.0% | 84.0% | 86.0% | RL |
- SFT Model | RL Model
- Live Demo (HF Space)
- API (Swagger UI)
- W&B Dashboard
- Technical Report (PDF) | LaTeX source
MobileViT-XXS model trained on 8 emotion classes, exported to ONNX for real-time browser inference.
Classes: Anger, Contempt, Disgust, Fear, Happy, Neutral, Sad, Surprise
Speech-to-text service with VAD sentence segmentation and per-segment emotion analysis, powered by Voxtral Mini 4B.
Browser (port 3030) → Server layer (Node, :3000) → Model layer (Python, :8000)
↑ Studio UI POST /api/speech-to-text POST /transcribe
↑ Upload dialog POST /api/transcribe-diarize POST /transcribe-diarize
GET /health GET /health
| Layer | Path | Role |
|---|---|---|
| Model | model/voxtral-server |
Voxtral inference, VAD segmentation, emotion analysis |
| Server | demo/server |
API entrypoint; proxies to Model |
| Frontend | demo |
Next.js UI (upload, Studio editor, waveform, timeline) |
| Evoxtral | training/scripts/ |
Training, eval, RL, serving for expressive transcription |
| FER | models/ |
Facial emotion recognition ONNX model |
See demo/README.md for full API and usage; model/voxtral-server/README.md for the Model API.
├── api/ # Python FastAPI — local Voxtral inference + FER
├── proxy/ # Node.js/Express — API gateway for frontend
├── web/ # Next.js — Studio editor UI
├── training/ # Fine-tuning code (SFT + RL), data prep, eval
│ └── scripts/ # Modal scripts: train, RL (RAFT), eval, serve
├── space/ # HuggingFace Space (Gradio demo)
├── models/ # FER ONNX model (MobileViT-XXS)
├── docs/ # Technical report, design docs, research refs
├── data/ # Training data scripts (audio files gitignored)
└── Dockerfile # Single-container HF Spaces build
Requirements: Python 3.10+, Node.js 20+, ffmpeg; GPU recommended.
cd model/voxtral-server
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reloadcd demo/server && npm install && npm run devcd demo && npm install && npm run devOpen http://localhost:3030.
modal deploy training/scripts/serve_modal.py- Models: Voxtral-Mini-3B + LoRA, Voxtral-Mini-4B, MobileViT-XXS
- Training: PyTorch, PEFT, Weights & Biases
- Inference: Modal (serverless GPU), HuggingFace ZeroGPU, ONNX Runtime
- Backend: FastAPI, Node.js
- Frontend: Next.js, Gradio