Skip to content

thesumedh/vidbuddy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VidBuddy β€” AI Audio Descriptions for Blind & Low-Vision Viewers

Azure AI Azure OpenAI React TypeScript WCAG 2.1 AA License: MIT


The Problem

300 million people worldwide are blind or have low vision. Most online video was built for sighted viewers.

When a teacher points at a whiteboard, a presenter switches slides, or a character reacts silently β€” that visual moment is invisible to anyone relying on audio. Existing captions only transcribe speech. They do nothing for the visual context that sighted viewers take for granted.

Audio Description fills that gap. A narrator quietly describes what is happening on screen during natural pauses in dialogue:

πŸŽ™ "Sarah just walked in. She looks a bit tired, but she's smiling at you from the left corner."

Creating AD tracks manually is slow and expensive. VidBuddy automates the entire workflow β€” from raw MP4 to a fully described, accessible video β€” without a backend server.


What is Audio Description?

Audio Description is a technique for describing what is happening during a video, to benefit audience members who are blind or have low vision. It generally takes the form of a second audio track, and is available on TV, streaming services, and at movie theaters.

The narration is timed to fit within silent parts of the video, so it does not overlap the dialogue and does not increase the length of the programme β€” unlike pausing to provide a description, which would break the viewing experience.

Providing content with audio description tracks is a legal requirement in several countries (USA, EU, Canada, Australia), and demand for AD-compliant content is accelerating as regulations tighten.

How VidBuddy Uses AI to Help

VidBuddy leverages AI to assist the AD authoring process end to end:

  1. Scene analysis β€” Azure Content Understanding generates a visual description for each shot and transcribes all dialogue
  2. Silent gap detection β€” VidBuddy identifies windows where narration can be inserted without overlapping speech
  3. Description rewriting β€” Azure OpenAI (or Gemini/Kimi) rewrites raw descriptions to fit precisely within each silent window
  4. Human review β€” the AI-generated script is presented to an AD editor as a draft to review, correct, and approve before any audio is synthesised
  5. Voice synthesis + export β€” Azure Neural TTS renders the final narration, and WebAssembly FFmpeg mixes it into a downloadable described MP4

We believe that making the AD authoring process faster, and thus less expensive, will result in more inclusive content being created β€” benefiting the 300 million people worldwide who are blind or have low vision.


Who This Is For

Role How they use VidBuddy
The viewer A blind or low-vision person who receives the exported described video and can now follow it with full context
The creator A teacher, journalist, media team, nonprofit, or content owner who runs the VidBuddy studio to generate, review, and export the described version
The evaluator A hackathon judge or accessibility reviewer exploring the automated β†’ human-review β†’ export pipeline

The viewer does not need to use the studio. The creator uses VidBuddy to produce a better version of the video for them.


How It Works

Source MP4
    β”‚
    β–Ό
[1] Azure Content Understanding
    Scene detection Β· Dialogue transcript Β· Silent gap identification
    β”‚
    β–Ό
[2] Azure OpenAI (GPT-4o-mini)          ← recommended
    Write a natural narrator sentence for each silent gap
    Tighten each sentence to fit within the available time window
    β”‚
    β–Ό
[3] VidBuddy Studio β€” Human Review
    Creator sees every timed segment
    Can edit, delete, reorder, or add descriptions before synthesis
    β”‚
    β–Ό
[4] Azure Neural TTS
    Synthesize each reviewed description into a natural voice clip
    β”‚
    β–Ό
[5] WebAssembly FFmpeg (in the browser)
    Mix voice clips into the original video timeline
    Export a described MP4 β€” no backend server required

Demo (No Credentials Needed)

The landing page includes a built-in walkthrough using two pre-rendered sample clips bundled with the repo:

  • demo_input.mp4 β€” the raw source clip
  • demo.mp4 β€” the same clip with audio descriptions added

Open the app, press Watch It Work, and the page animates through the pipeline before revealing the described output. No Azure keys or account needed.


Live Processing (Bring Your Own Credentials)

To run the full live pipeline with your own videos:

  1. Clone the repo and run npm install
  2. Copy .env.example to .env and fill in your Azure credentials
  3. Run npm run dev β€” the studio becomes fully functional

πŸ’¬ Want a live demo but don't have Azure credentials? Reach out on GitHub: @thesumedh Live execution costs real Azure credits, so the demo credentials are not published here. Drop a message and we can arrange a live walkthrough.


Quick Start

npm install
npm run dev

Open http://localhost:5173. The landing page and built-in demo work immediately β€” no credentials needed.


Setting Up Credentials

Copy and rename the example file:

cp .env.example .env

Then open .env and fill in your values.

Required for all live processing

VITE_STORAGE_ACCOUNT=your-storage-account-name
VITE_BLOB_SAS_TOKEN=sp=...&st=...&se=...&sv=...&sr=c&sig=...
VITE_AI_SERVICES_RESOURCE=your-ai-services-resource-name
VITE_AI_SERVICES_KEY=your-ai-services-key
VITE_AI_SERVICES_REGION=westus

Rewrite model (choose one)

Azure OpenAI β€” recommended (keeps everything in one Azure account):

VITE_LLM_PROVIDER=azure
VITE_AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
VITE_AZURE_OPENAI_API_KEY=your-openai-key
VITE_AZURE_OPENAI_MODEL=gpt-4o-mini

Budget alternatives: If you don't have an Azure OpenAI deployment, you can swap to a different rewrite model. VidBuddy supports:

Provider Set VITE_LLM_PROVIDER to Extra key needed Notes
Azure OpenAI ⭐ azure VITE_AZURE_OPENAI_* Recommended β€” all-Azure, enterprise-grade
Google Gemini gemini VITE_GEMINI_API_KEY Free tier available β€” good hackathon option
Kimi K2.5 kimi VITE_KIMI_API_KEY Strong multimodal alternative from Moonshot AI

The rest of the pipeline (storage, scene analysis, speech) always runs on Azure regardless of which rewrite model you pick.


Azure Resource Setup

Resource What it does Required regions
Azure AI Services Scene analysis (Content Understanding) + Text-to-Speech westus Β· swedencentral Β· australiaeast
Azure Storage Account Store uploaded videos and synthesized audio clips Any region
Azure OpenAI deployment Rewrite raw descriptions into natural narration Any region with GPT-4o-mini available

Storage CORS (required for browser uploads)

In Azure Portal β†’ your Storage Account β†’ Resource Sharing (CORS):

Setting Value
Allowed origins http://localhost:5173 (add your deployed URL too)
Allowed methods GET, PUT, DELETE, OPTIONS
Allowed headers *
Max age 86400

Container name must be exactly: audio-description

Full step-by-step setup with screenshots: docs/SETUP_GUIDE.md


How the AI Pipeline Works in Detail

Step 1 β€” Scene Analysis (Azure Content Understanding)

Azure Content Understanding processes the video and returns one entry per video shot, including:

  • All transcript phrases detected in the shot
  • A GPT-generated visual description for what is happening in the frame

Step 2 β€” Silent Gap Detection

VidBuddy groups consecutive shots with no detected speech into silent intervals β€” windows where audio description can be inserted without talking over dialogue.

Step 3 β€” Description Rewriting

For each silent interval, VidBuddy calculates how many words fit at a natural narration rate (~3 words/second) and calls the rewrite model with structured instructions:

  • Write in present tense with a warm, natural narrator voice
  • Be specific and vivid β€” describe actions, expressions, positions
  • Do not repeat what was already said in the previous description
  • Do not explain meaning β€” only describe what is seen

Step 4 β€” Human Review

The creator reviews every timed segment in the studio before any audio is generated. They can:

  • Edit any description for accuracy or style
  • Delete irrelevant segments
  • Add new ones for missed visual moments
  • Reorder segments

This step exists because automated descriptions can be vague, redundant, or miss what matters for accessibility. A human reviewer catches those issues before the voice is synthesised.

Step 5 β€” Neural Voice Synthesis

Each reviewed description is synthesised into a .wav clip using Azure Neural TTS. Clips are stored in Azure Blob Storage alongside the video manifest.

Step 6 β€” In-Browser Video Export

The browser uses FFmpeg compiled to WebAssembly to mix the original video with the timed audio clips. The final described MP4 downloads directly from the browser β€” no backend render farm, no server costs.


Architecture

Browser only β€” no backend server required
β”‚
β”œβ”€β”€ LandingPage.tsx          Landing page, built-in demo, "Launch Studio" entry
β”œβ”€β”€ App.tsx                  Root β€” routes between landing and studio views
β”œβ”€β”€ VideoPlayer.tsx          Video playback, library list, download with AD
β”œβ”€β”€ DescriptionTable.tsx     Timed segment editor (add / edit / delete / reorder)
β”œβ”€β”€ UploadVideoDialog.tsx    Upload MP4 β†’ trigger Azure Content Understanding
β”œβ”€β”€ ProcessVideoDialog.tsx   Poll analysis β†’ rewrite β†’ TTS β†’ load into player
β”œβ”€β”€ MissingKeys.tsx          Warns when .env credentials are incomplete
β”‚
└── helpers/
    β”œβ”€β”€ ContentUnderstandingHelper.ts  Azure Content Understanding + LLM rewrite
    β”œβ”€β”€ TtsHelper.ts                   Azure Speech SDK synthesis + audio preload
    β”œβ”€β”€ BlobHelper.ts                  Azure Blob upload / list / delete
    β”œβ”€β”€ Prompts.ts                     Centralised AI prompt configuration
    β”œβ”€β”€ StateContext.tsx               Global React context (eliminates prop drilling)
    └── Helper.ts                      Time conversion utilities

docs/
└── SETUP_GUIDE.md           Detailed setup guide with screenshots

Accessibility Standards Addressed

Standard Requirement
WCAG 2.1 AA β€” SC 1.2.5 Pre-recorded video must have audio description
WCAG 2.1 AAA β€” SC 1.2.7 Extended audio description for complex visual moments
ADA Title II US public sector video accessibility
Section 508 US federal agency requirements
European Accessibility Act (EAA) EU requirement effective June 2025
AODA Ontario, Canada accessibility standard

Commands

npm install       # Install dependencies
npm run dev       # Start dev server at localhost:5173
npm run build     # Production build
npm run lint      # ESLint check

Troubleshooting

Studio loads but live processing doesn't start

  • The warning banner in the studio lists every missing .env key β€” start there
  • Verify your Blob SAS token hasn't expired (check the se= parameter)
  • Confirm the blob container name is exactly audio-description
  • Confirm CORS is enabled for http://localhost:5173

Azure Content Understanding returns an error

  • Confirm VITE_AI_SERVICES_REGION matches your resource's region exactly
  • Content Understanding is only available in westus, swedencentral, and australiaeast

Rewrite step produces empty descriptions

  • Confirm VITE_LLM_PROVIDER matches the provider you configured (azure, gemini, or kimi)
  • For Azure OpenAI: check the endpoint format (https://your-resource.openai.azure.com)
  • For Gemini: verify the API key is active and the model name is a released version

TTS clips are silent or missing

  • Your Azure AI Services key must have access to the Speech service
  • Verify the SAS token allows PUT on the audio-description container

Full troubleshooting guide: docs/SETUP_GUIDE.md


Security

VidBuddy is designed for hackathon and demonstration use.

  • VITE_* variables are bundled into the browser JavaScript at build time β€” they are visible to anyone who inspects the page source
  • For production: move secret handling to a server-side token proxy, and use short-lived SAS tokens

Contact

Built for the hackathon. If you want a live demo or have questions about running the full pipeline:

GitHub: @thesumedh

Live execution uses real Azure credits, so the demo keys are not published. Reach out and we will arrange a walkthrough.


References


License

MIT Β© 2025 VidBuddy Contributors. See license.md.

About

AI Audio Descriptions for Blind & Low-Vision Viewers

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages