A real-time speech-to-text application powered by Mistral AI's Voxtral models. This project demonstrates how to implement "near real-time" transcription using Next.js, Client-Side VAD (Voice Activity Detection), and Mistral's transcription API.
sequenceDiagram
participant U as User
participant C as Client (useScribe)
participant S as Next.js Server (/api/transcribe)
participant M as Mistral API
U->>C: Speaks ("Hello world")
C->>C: Buffers Audio & Checks Volume (VAD)
Note over C: User pauses for 600ms
C->>C: VAD Trigger -> Encode WAV
C->>S: POST /api/transcribe (FormData)
S->>M: POST /v1/audio/transcriptions (Proxy)
M-->>S: JSON Response ("Hello world")
S-->>C: JSON Response
C->>U: UI Update (Append Text)
- Live Transcription: Converts speech to text with low latency.
- Smart VAD (Voice Activity Detection): Automatically detects when you stop speaking to send audio phrases, ensuring better context and accuracy.
- Secure Architecture: API keys are kept safe on the server side using a proxy route.
- Modern UI: Built with Next.js, Tailwind CSS, and Framer Motion for a smooth user experience.
- Node.js 18+
- pnpm (recommended) or npm/yarn
- A Mistral API Key
-
Clone the repository (if you haven't already).
-
Install dependencies:
pnpm install
-
Configure Environment: Create a
.env.localfile in the root directory and add your Mistral API key:MISTRAL_API_KEY=your_actual_api_key_here
-
Run the Development Server:
pnpm dev
-
Open the App: Navigate to http://localhost:3000 in your browser.
- Audio Capture: The app captures microphone input using the Web Audio API.
- VAD Processing: It analyzes the audio volume in real-time.
- If you pause for 600ms, it detects the end of a phrase.
- If you talk for 5 seconds continuously, it forces a save to keep latency low.
- WAV Encoding: The raw audio buffer is converted to a WAV file on the client.
- Upload: The WAV file is sent to the Next.js API route (
/api/transcribe). - Transcription: The server proxies the file to Mistral's API and returns the text.
- Framework: Next.js 15 (App Router)
- Styling: Tailwind CSS
- Animations: Framer Motion
- Audio: Web Audio API (ScriptProcessorNode)
- AI: Mistral Voxtral