VidBuddy — AI Audio Descriptions for Blind & Low-Vision Viewers

The Problem

300 million people worldwide are blind or have low vision. Most online video was built for sighted viewers.

When a teacher points at a whiteboard, a presenter switches slides, or a character reacts silently — that visual moment is invisible to anyone relying on audio. Existing captions only transcribe speech. They do nothing for the visual context that sighted viewers take for granted.

Audio Description fills that gap. A narrator quietly describes what is happening on screen during natural pauses in dialogue:

🎙 "Sarah just walked in. She looks a bit tired, but she's smiling at you from the left corner."

Creating AD tracks manually is slow and expensive. VidBuddy automates the entire workflow — from raw MP4 to a fully described, accessible video — without a backend server.

What is Audio Description?

Audio Description is a technique for describing what is happening during a video, to benefit audience members who are blind or have low vision. It generally takes the form of a second audio track, and is available on TV, streaming services, and at movie theaters.

The narration is timed to fit within silent parts of the video, so it does not overlap the dialogue and does not increase the length of the programme — unlike pausing to provide a description, which would break the viewing experience.

Providing content with audio description tracks is a legal requirement in several countries (USA, EU, Canada, Australia), and demand for AD-compliant content is accelerating as regulations tighten.

How VidBuddy Uses AI to Help

VidBuddy leverages AI to assist the AD authoring process end to end:

Scene analysis — Azure Content Understanding generates a visual description for each shot and transcribes all dialogue
Silent gap detection — VidBuddy identifies windows where narration can be inserted without overlapping speech
Description rewriting — Azure OpenAI (or Gemini/Kimi) rewrites raw descriptions to fit precisely within each silent window
Human review — the AI-generated script is presented to an AD editor as a draft to review, correct, and approve before any audio is synthesised
Voice synthesis + export — Azure Neural TTS renders the final narration, and WebAssembly FFmpeg mixes it into a downloadable described MP4

We believe that making the AD authoring process faster, and thus less expensive, will result in more inclusive content being created — benefiting the 300 million people worldwide who are blind or have low vision.

Who This Is For

Role	How they use VidBuddy
The viewer	A blind or low-vision person who receives the exported described video and can now follow it with full context
The creator	A teacher, journalist, media team, nonprofit, or content owner who runs the VidBuddy studio to generate, review, and export the described version
The evaluator	A hackathon judge or accessibility reviewer exploring the automated → human-review → export pipeline

The viewer does not need to use the studio. The creator uses VidBuddy to produce a better version of the video for them.

How It Works

Source MP4
    │
    ▼
[1] Azure Content Understanding
    Scene detection · Dialogue transcript · Silent gap identification
    │
    ▼
[2] Azure OpenAI (GPT-4o-mini)          ← recommended
    Write a natural narrator sentence for each silent gap
    Tighten each sentence to fit within the available time window
    │
    ▼
[3] VidBuddy Studio — Human Review
    Creator sees every timed segment
    Can edit, delete, reorder, or add descriptions before synthesis
    │
    ▼
[4] Azure Neural TTS
    Synthesize each reviewed description into a natural voice clip
    │
    ▼
[5] WebAssembly FFmpeg (in the browser)
    Mix voice clips into the original video timeline
    Export a described MP4 — no backend server required

Demo (No Credentials Needed)

The landing page includes a built-in walkthrough using two pre-rendered sample clips bundled with the repo:

demo_input.mp4 — the raw source clip
demo.mp4 — the same clip with audio descriptions added

Open the app, press Watch It Work, and the page animates through the pipeline before revealing the described output. No Azure keys or account needed.

Live Processing (Bring Your Own Credentials)

To run the full live pipeline with your own videos:

Clone the repo and run npm install
Copy .env.example to .env and fill in your Azure credentials
Run npm run dev — the studio becomes fully functional

💬 Want a live demo but don't have Azure credentials? Reach out on GitHub: @thesumedh Live execution costs real Azure credits, so the demo credentials are not published here. Drop a message and we can arrange a live walkthrough.

Quick Start

npm install
npm run dev

Open http://localhost:5173. The landing page and built-in demo work immediately — no credentials needed.

Setting Up Credentials

Copy and rename the example file:

cp .env.example .env

Then open .env and fill in your values.

Required for all live processing

VITE_STORAGE_ACCOUNT=your-storage-account-name
VITE_BLOB_SAS_TOKEN=sp=...&st=...&se=...&sv=...&sr=c&sig=...
VITE_AI_SERVICES_RESOURCE=your-ai-services-resource-name
VITE_AI_SERVICES_KEY=your-ai-services-key
VITE_AI_SERVICES_REGION=westus

Rewrite model (choose one)

Azure OpenAI — recommended (keeps everything in one Azure account):

VITE_LLM_PROVIDER=azure
VITE_AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
VITE_AZURE_OPENAI_API_KEY=your-openai-key
VITE_AZURE_OPENAI_MODEL=gpt-4o-mini

Budget alternatives: If you don't have an Azure OpenAI deployment, you can swap to a different rewrite model. VidBuddy supports:

Provider Set VITE_LLM_PROVIDER to Extra key needed Notes

Azure OpenAI ⭐ azure VITE_AZURE_OPENAI_* Recommended — all-Azure, enterprise-grade

Google Gemini gemini VITE_GEMINI_API_KEY Free tier available — good hackathon option

Kimi K2.5 kimi VITE_KIMI_API_KEY Strong multimodal alternative from Moonshot AI

The rest of the pipeline (storage, scene analysis, speech) always runs on Azure regardless of which rewrite model you pick.

Azure Resource Setup

Resource	What it does	Required regions
Azure AI Services	Scene analysis (Content Understanding) + Text-to-Speech	`westus` · `swedencentral` · `australiaeast`
Azure Storage Account	Store uploaded videos and synthesized audio clips	Any region
Azure OpenAI deployment	Rewrite raw descriptions into natural narration	Any region with GPT-4o-mini available

Storage CORS (required for browser uploads)

In Azure Portal → your Storage Account → Resource Sharing (CORS):

Setting	Value
Allowed origins	`http://localhost:5173` (add your deployed URL too)
Allowed methods	`GET, PUT, DELETE, OPTIONS`
Allowed headers	`*`
Max age	`86400`

Container name must be exactly: audio-description

Full step-by-step setup with screenshots: docs/SETUP_GUIDE.md

How the AI Pipeline Works in Detail

Step 1 — Scene Analysis (Azure Content Understanding)

Azure Content Understanding processes the video and returns one entry per video shot, including:

All transcript phrases detected in the shot
A GPT-generated visual description for what is happening in the frame

Step 2 — Silent Gap Detection

VidBuddy groups consecutive shots with no detected speech into silent intervals — windows where audio description can be inserted without talking over dialogue.

Step 3 — Description Rewriting

For each silent interval, VidBuddy calculates how many words fit at a natural narration rate (~3 words/second) and calls the rewrite model with structured instructions:

Write in present tense with a warm, natural narrator voice
Be specific and vivid — describe actions, expressions, positions
Do not repeat what was already said in the previous description
Do not explain meaning — only describe what is seen

Step 4 — Human Review

The creator reviews every timed segment in the studio before any audio is generated. They can:

Edit any description for accuracy or style
Delete irrelevant segments
Add new ones for missed visual moments
Reorder segments

This step exists because automated descriptions can be vague, redundant, or miss what matters for accessibility. A human reviewer catches those issues before the voice is synthesised.

Step 5 — Neural Voice Synthesis

Each reviewed description is synthesised into a .wav clip using Azure Neural TTS. Clips are stored in Azure Blob Storage alongside the video manifest.

Step 6 — In-Browser Video Export

The browser uses FFmpeg compiled to WebAssembly to mix the original video with the timed audio clips. The final described MP4 downloads directly from the browser — no backend render farm, no server costs.

Architecture

Browser only — no backend server required
│
├── LandingPage.tsx          Landing page, built-in demo, "Launch Studio" entry
├── App.tsx                  Root — routes between landing and studio views
├── VideoPlayer.tsx          Video playback, library list, download with AD
├── DescriptionTable.tsx     Timed segment editor (add / edit / delete / reorder)
├── UploadVideoDialog.tsx    Upload MP4 → trigger Azure Content Understanding
├── ProcessVideoDialog.tsx   Poll analysis → rewrite → TTS → load into player
├── MissingKeys.tsx          Warns when .env credentials are incomplete
│
└── helpers/
    ├── ContentUnderstandingHelper.ts  Azure Content Understanding + LLM rewrite
    ├── TtsHelper.ts                   Azure Speech SDK synthesis + audio preload
    ├── BlobHelper.ts                  Azure Blob upload / list / delete
    ├── Prompts.ts                     Centralised AI prompt configuration
    ├── StateContext.tsx               Global React context (eliminates prop drilling)
    └── Helper.ts                      Time conversion utilities

docs/
└── SETUP_GUIDE.md           Detailed setup guide with screenshots

Accessibility Standards Addressed

Standard	Requirement
WCAG 2.1 AA — SC 1.2.5	Pre-recorded video must have audio description
WCAG 2.1 AAA — SC 1.2.7	Extended audio description for complex visual moments
ADA Title II	US public sector video accessibility
Section 508	US federal agency requirements
European Accessibility Act (EAA)	EU requirement effective June 2025
AODA	Ontario, Canada accessibility standard

Commands

npm install       # Install dependencies
npm run dev       # Start dev server at localhost:5173
npm run build     # Production build
npm run lint      # ESLint check

Troubleshooting

Studio loads but live processing doesn't start

The warning banner in the studio lists every missing .env key — start there
Verify your Blob SAS token hasn't expired (check the se= parameter)
Confirm the blob container name is exactly audio-description
Confirm CORS is enabled for http://localhost:5173

Azure Content Understanding returns an error

Confirm VITE_AI_SERVICES_REGION matches your resource's region exactly
Content Understanding is only available in westus, swedencentral, and australiaeast

Rewrite step produces empty descriptions

Confirm VITE_LLM_PROVIDER matches the provider you configured (azure, gemini, or kimi)
For Azure OpenAI: check the endpoint format (https://your-resource.openai.azure.com)
For Gemini: verify the API key is active and the model name is a released version

TTS clips are silent or missing

Your Azure AI Services key must have access to the Speech service
Verify the SAS token allows PUT on the audio-description container

Full troubleshooting guide: docs/SETUP_GUIDE.md

Security

VidBuddy is designed for hackathon and demonstration use.

VITE_* variables are bundled into the browser JavaScript at build time — they are visible to anyone who inspects the page source
For production: move secret handling to a server-side token proxy, and use short-lived SAS tokens

Contact

Built for the hackathon. If you want a live demo or have questions about running the full pipeline:

GitHub: @thesumedh

Live execution uses real Azure credits, so the demo keys are not published. Reach out and we will arrange a walkthrough.

References

Azure Content Understanding: learn.microsoft.com/.../content-understanding
WCAG 2.1 Audio Description: w3.org/WAI/WCAG21/quickref/#audio-description-prerecorded
Azure Neural TTS: learn.microsoft.com/.../text-to-speech
Azure OpenAI models: learn.microsoft.com/.../models
Gemini API (budget alternative): ai.google.dev/gemini-api/docs
Kimi K2.5 (alternative): platform.moonshot.ai

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.edge-shot		.edge-shot
docs		docs
src		src
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
code_of_conduct.md		code_of_conduct.md
demo.mp4		demo.mp4
demo_input.mp4		demo_input.mp4
eslint.config.js		eslint.config.js
index.html		index.html
license.md		license.md
package-lock.json		package-lock.json
package.json		package.json
readme.md		readme.md
security.md		security.md
tsconfig.app.json		tsconfig.app.json
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
typescript_errors.txt		typescript_errors.txt
vite.config.ts		vite.config.ts

Provider	Set `VITE_LLM_PROVIDER` to	Extra key needed	Notes
Azure OpenAI ⭐	`azure`	`VITE_AZURE_OPENAI_*`	Recommended — all-Azure, enterprise-grade
Google Gemini	`gemini`	`VITE_GEMINI_API_KEY`	Free tier available — good hackathon option
Kimi K2.5	`kimi`	`VITE_KIMI_API_KEY`	Strong multimodal alternative from Moonshot AI

Folders and files

Latest commit

History

Repository files navigation

VidBuddy — AI Audio Descriptions for Blind & Low-Vision Viewers

The Problem

What is Audio Description?

How VidBuddy Uses AI to Help

Who This Is For

How It Works

Demo (No Credentials Needed)

Live Processing (Bring Your Own Credentials)

Quick Start

Setting Up Credentials

Required for all live processing

Rewrite model (choose one)

Azure Resource Setup

Storage CORS (required for browser uploads)

How the AI Pipeline Works in Detail

Step 1 — Scene Analysis (Azure Content Understanding)

Step 2 — Silent Gap Detection

Step 3 — Description Rewriting

Step 4 — Human Review

Step 5 — Neural Voice Synthesis

Step 6 — In-Browser Video Export

Architecture

Accessibility Standards Addressed

Commands

Troubleshooting

Security

Contact

References

License

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages