Skip to content

aytoast/ser

Repository files navigation

title Ethos Studio
emoji 🎤
colorFrom purple
colorTo indigo
sdk docker
app_port 7860
pinned false

Ethos Studio — Emotional Speech Recognition

Built for the Mistral AI Online Hackathon 2026 (W&B Fine-Tuning Track).

Ethos Studio is a full-stack emotional speech recognition platform combining real-time transcription, facial emotion recognition, and expressive audio tagging. It turns raw speech into richly annotated transcripts with emotions, non-verbal sounds, and delivery cues.

Key Components

Evoxtral — Expressive Tagged Transcription

LoRA finetune of Voxtral-Mini-3B-2507 that produces transcriptions with inline ElevenLabs v3 audio tags. Two-stage pipeline: SFT (3 epochs) → RL via RAFT (rejection sampling, 1 epoch).

Standard ASR: So I was thinking maybe we could try that new restaurant downtown.

Evoxtral: [nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously]

Two model variants:

  • Evoxtral SFT — Best transcription accuracy (lowest WER)
  • Evoxtral RL — Best expressive tag accuracy (highest Tag F1)
Metric Base Voxtral Evoxtral SFT Evoxtral RL Best
WER 6.64% 4.47% 5.12% SFT
CER 2.72% 1.23% 1.48% SFT
Tag F1 22.0% 67.2% 69.4% RL
Tag Recall 22.0% 69.4% 72.7% RL
Emphasis F1 42.0% 84.0% 86.0% RL

FER — Facial Emotion Recognition

MobileViT-XXS model trained on 8 emotion classes, exported to ONNX for real-time browser inference.

Classes: Anger, Contempt, Disgust, Fear, Happy, Neutral, Sad, Surprise

Voxtral Server — Speech-to-Text + Emotion

Speech-to-text service with VAD sentence segmentation and per-segment emotion analysis, powered by Voxtral Mini 4B.

Architecture

Browser (port 3030)  →  Server layer (Node, :3000)  →  Model layer (Python, :8000)
      ↑ Studio UI            POST /api/speech-to-text          POST /transcribe
      ↑ Upload dialog        POST /api/transcribe-diarize      POST /transcribe-diarize
                             GET  /health                       GET  /health
Layer Path Role
Model model/voxtral-server Voxtral inference, VAD segmentation, emotion analysis
Server demo/server API entrypoint; proxies to Model
Frontend demo Next.js UI (upload, Studio editor, waveform, timeline)
Evoxtral training/scripts/ Training, eval, RL, serving for expressive transcription
FER models/ Facial emotion recognition ONNX model

See demo/README.md for full API and usage; model/voxtral-server/README.md for the Model API.

Project Structure

├── api/                    # Python FastAPI — local Voxtral inference + FER
├── proxy/                  # Node.js/Express — API gateway for frontend
├── web/                    # Next.js — Studio editor UI
├── training/               # Fine-tuning code (SFT + RL), data prep, eval
│   └── scripts/            # Modal scripts: train, RL (RAFT), eval, serve
├── space/                  # HuggingFace Space (Gradio demo)
├── models/                 # FER ONNX model (MobileViT-XXS)
├── docs/                   # Technical report, design docs, research refs
├── data/                   # Training data scripts (audio files gitignored)
└── Dockerfile              # Single-container HF Spaces build

How to Run

Requirements: Python 3.10+, Node.js 20+, ffmpeg; GPU recommended.

Model layer (port 8000)

cd model/voxtral-server
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Server layer (port 3000)

cd demo/server && npm install && npm run dev

Frontend (port 3030)

cd demo && npm install && npm run dev

Open http://localhost:3030.

Evoxtral API (Modal)

modal deploy training/scripts/serve_modal.py

Tech Stack

  • Models: Voxtral-Mini-3B + LoRA, Voxtral-Mini-4B, MobileViT-XXS
  • Training: PyTorch, PEFT, Weights & Biases
  • Inference: Modal (serverless GPU), HuggingFace ZeroGPU, ONNX Runtime
  • Backend: FastAPI, Node.js
  • Frontend: Next.js, Gradio

Links

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors