300 million people worldwide are blind or have low vision. Most online video was built for sighted viewers.
When a teacher points at a whiteboard, a presenter switches slides, or a character reacts silently β that visual moment is invisible to anyone relying on audio. Existing captions only transcribe speech. They do nothing for the visual context that sighted viewers take for granted.
Audio Description fills that gap. A narrator quietly describes what is happening on screen during natural pauses in dialogue:
π "Sarah just walked in. She looks a bit tired, but she's smiling at you from the left corner."
Creating AD tracks manually is slow and expensive. VidBuddy automates the entire workflow β from raw MP4 to a fully described, accessible video β without a backend server.
Audio Description is a technique for describing what is happening during a video, to benefit audience members who are blind or have low vision. It generally takes the form of a second audio track, and is available on TV, streaming services, and at movie theaters.
The narration is timed to fit within silent parts of the video, so it does not overlap the dialogue and does not increase the length of the programme β unlike pausing to provide a description, which would break the viewing experience.
Providing content with audio description tracks is a legal requirement in several countries (USA, EU, Canada, Australia), and demand for AD-compliant content is accelerating as regulations tighten.
VidBuddy leverages AI to assist the AD authoring process end to end:
- Scene analysis β Azure Content Understanding generates a visual description for each shot and transcribes all dialogue
- Silent gap detection β VidBuddy identifies windows where narration can be inserted without overlapping speech
- Description rewriting β Azure OpenAI (or Gemini/Kimi) rewrites raw descriptions to fit precisely within each silent window
- Human review β the AI-generated script is presented to an AD editor as a draft to review, correct, and approve before any audio is synthesised
- Voice synthesis + export β Azure Neural TTS renders the final narration, and WebAssembly FFmpeg mixes it into a downloadable described MP4
We believe that making the AD authoring process faster, and thus less expensive, will result in more inclusive content being created β benefiting the 300 million people worldwide who are blind or have low vision.
| Role | How they use VidBuddy |
|---|---|
| The viewer | A blind or low-vision person who receives the exported described video and can now follow it with full context |
| The creator | A teacher, journalist, media team, nonprofit, or content owner who runs the VidBuddy studio to generate, review, and export the described version |
| The evaluator | A hackathon judge or accessibility reviewer exploring the automated β human-review β export pipeline |
The viewer does not need to use the studio. The creator uses VidBuddy to produce a better version of the video for them.
Source MP4
β
βΌ
[1] Azure Content Understanding
Scene detection Β· Dialogue transcript Β· Silent gap identification
β
βΌ
[2] Azure OpenAI (GPT-4o-mini) β recommended
Write a natural narrator sentence for each silent gap
Tighten each sentence to fit within the available time window
β
βΌ
[3] VidBuddy Studio β Human Review
Creator sees every timed segment
Can edit, delete, reorder, or add descriptions before synthesis
β
βΌ
[4] Azure Neural TTS
Synthesize each reviewed description into a natural voice clip
β
βΌ
[5] WebAssembly FFmpeg (in the browser)
Mix voice clips into the original video timeline
Export a described MP4 β no backend server required
The landing page includes a built-in walkthrough using two pre-rendered sample clips bundled with the repo:
demo_input.mp4β the raw source clipdemo.mp4β the same clip with audio descriptions added
Open the app, press Watch It Work, and the page animates through the pipeline before revealing the described output. No Azure keys or account needed.
To run the full live pipeline with your own videos:
- Clone the repo and run
npm install - Copy
.env.exampleto.envand fill in your Azure credentials - Run
npm run devβ the studio becomes fully functional
π¬ Want a live demo but don't have Azure credentials? Reach out on GitHub: @thesumedh Live execution costs real Azure credits, so the demo credentials are not published here. Drop a message and we can arrange a live walkthrough.
npm install
npm run devOpen http://localhost:5173. The landing page and built-in demo work immediately β no credentials needed.
Copy and rename the example file:
cp .env.example .envThen open .env and fill in your values.
VITE_STORAGE_ACCOUNT=your-storage-account-name
VITE_BLOB_SAS_TOKEN=sp=...&st=...&se=...&sv=...&sr=c&sig=...
VITE_AI_SERVICES_RESOURCE=your-ai-services-resource-name
VITE_AI_SERVICES_KEY=your-ai-services-key
VITE_AI_SERVICES_REGION=westusAzure OpenAI β recommended (keeps everything in one Azure account):
VITE_LLM_PROVIDER=azure
VITE_AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
VITE_AZURE_OPENAI_API_KEY=your-openai-key
VITE_AZURE_OPENAI_MODEL=gpt-4o-miniBudget alternatives: If you don't have an Azure OpenAI deployment, you can swap to a different rewrite model. VidBuddy supports:
Provider Set VITE_LLM_PROVIDERtoExtra key needed Notes Azure OpenAI β azureVITE_AZURE_OPENAI_*Recommended β all-Azure, enterprise-grade Google Gemini geminiVITE_GEMINI_API_KEYFree tier available β good hackathon option Kimi K2.5 kimiVITE_KIMI_API_KEYStrong multimodal alternative from Moonshot AI The rest of the pipeline (storage, scene analysis, speech) always runs on Azure regardless of which rewrite model you pick.
| Resource | What it does | Required regions |
|---|---|---|
| Azure AI Services | Scene analysis (Content Understanding) + Text-to-Speech | westus Β· swedencentral Β· australiaeast |
| Azure Storage Account | Store uploaded videos and synthesized audio clips | Any region |
| Azure OpenAI deployment | Rewrite raw descriptions into natural narration | Any region with GPT-4o-mini available |
In Azure Portal β your Storage Account β Resource Sharing (CORS):
| Setting | Value |
|---|---|
| Allowed origins | http://localhost:5173 (add your deployed URL too) |
| Allowed methods | GET, PUT, DELETE, OPTIONS |
| Allowed headers | * |
| Max age | 86400 |
Container name must be exactly: audio-description
Full step-by-step setup with screenshots: docs/SETUP_GUIDE.md
Azure Content Understanding processes the video and returns one entry per video shot, including:
- All transcript phrases detected in the shot
- A GPT-generated visual description for what is happening in the frame
VidBuddy groups consecutive shots with no detected speech into silent intervals β windows where audio description can be inserted without talking over dialogue.
For each silent interval, VidBuddy calculates how many words fit at a natural narration rate (~3 words/second) and calls the rewrite model with structured instructions:
- Write in present tense with a warm, natural narrator voice
- Be specific and vivid β describe actions, expressions, positions
- Do not repeat what was already said in the previous description
- Do not explain meaning β only describe what is seen
The creator reviews every timed segment in the studio before any audio is generated. They can:
- Edit any description for accuracy or style
- Delete irrelevant segments
- Add new ones for missed visual moments
- Reorder segments
This step exists because automated descriptions can be vague, redundant, or miss what matters for accessibility. A human reviewer catches those issues before the voice is synthesised.
Each reviewed description is synthesised into a .wav clip using Azure Neural TTS. Clips are stored in Azure Blob Storage alongside the video manifest.
The browser uses FFmpeg compiled to WebAssembly to mix the original video with the timed audio clips. The final described MP4 downloads directly from the browser β no backend render farm, no server costs.
Browser only β no backend server required
β
βββ LandingPage.tsx Landing page, built-in demo, "Launch Studio" entry
βββ App.tsx Root β routes between landing and studio views
βββ VideoPlayer.tsx Video playback, library list, download with AD
βββ DescriptionTable.tsx Timed segment editor (add / edit / delete / reorder)
βββ UploadVideoDialog.tsx Upload MP4 β trigger Azure Content Understanding
βββ ProcessVideoDialog.tsx Poll analysis β rewrite β TTS β load into player
βββ MissingKeys.tsx Warns when .env credentials are incomplete
β
βββ helpers/
βββ ContentUnderstandingHelper.ts Azure Content Understanding + LLM rewrite
βββ TtsHelper.ts Azure Speech SDK synthesis + audio preload
βββ BlobHelper.ts Azure Blob upload / list / delete
βββ Prompts.ts Centralised AI prompt configuration
βββ StateContext.tsx Global React context (eliminates prop drilling)
βββ Helper.ts Time conversion utilities
docs/
βββ SETUP_GUIDE.md Detailed setup guide with screenshots
| Standard | Requirement |
|---|---|
| WCAG 2.1 AA β SC 1.2.5 | Pre-recorded video must have audio description |
| WCAG 2.1 AAA β SC 1.2.7 | Extended audio description for complex visual moments |
| ADA Title II | US public sector video accessibility |
| Section 508 | US federal agency requirements |
| European Accessibility Act (EAA) | EU requirement effective June 2025 |
| AODA | Ontario, Canada accessibility standard |
npm install # Install dependencies
npm run dev # Start dev server at localhost:5173
npm run build # Production build
npm run lint # ESLint checkStudio loads but live processing doesn't start
- The warning banner in the studio lists every missing
.envkey β start there - Verify your Blob SAS token hasn't expired (check the
se=parameter) - Confirm the blob container name is exactly
audio-description - Confirm CORS is enabled for
http://localhost:5173
Azure Content Understanding returns an error
- Confirm
VITE_AI_SERVICES_REGIONmatches your resource's region exactly - Content Understanding is only available in
westus,swedencentral, andaustraliaeast
Rewrite step produces empty descriptions
- Confirm
VITE_LLM_PROVIDERmatches the provider you configured (azure,gemini, orkimi) - For Azure OpenAI: check the endpoint format (
https://your-resource.openai.azure.com) - For Gemini: verify the API key is active and the model name is a released version
TTS clips are silent or missing
- Your Azure AI Services key must have access to the Speech service
- Verify the SAS token allows
PUTon theaudio-descriptioncontainer
Full troubleshooting guide: docs/SETUP_GUIDE.md
VidBuddy is designed for hackathon and demonstration use.
VITE_*variables are bundled into the browser JavaScript at build time β they are visible to anyone who inspects the page source- For production: move secret handling to a server-side token proxy, and use short-lived SAS tokens
Built for the hackathon. If you want a live demo or have questions about running the full pipeline:
GitHub: @thesumedh
Live execution uses real Azure credits, so the demo keys are not published. Reach out and we will arrange a walkthrough.
- Azure Content Understanding: learn.microsoft.com/.../content-understanding
- WCAG 2.1 Audio Description: w3.org/WAI/WCAG21/quickref/#audio-description-prerecorded
- Azure Neural TTS: learn.microsoft.com/.../text-to-speech
- Azure OpenAI models: learn.microsoft.com/.../models
- Gemini API (budget alternative): ai.google.dev/gemini-api/docs
- Kimi K2.5 (alternative): platform.moonshot.ai
MIT Β© 2025 VidBuddy Contributors. See license.md.