Turn text into a narrated visual story: scenes, images, and audio — in seconds.
🏆 Built at the Good Vibes Only AI/ML Buildathon @ USC (2025)
🏆 1st place (Context Engineering), Software Engineering Award
https://visurai-story-maker.lovable.app/
Project Demo : https://drive.google.com/file/d/16_YFVfVJoDPQqLkXXaRXSv_Dyr98bxey/view?usp=sharing
Visurai helps dyslexic and visual learners comprehend material by converting text into a sequence of AI-generated images with optional narration.
Paste any text and get:
- A title and segmented scenes that preserve key facts and names
- High-quality images per scene (Flux via Replicate or OpenAI gpt-image-1)
- Per‑scene TTS audio and a single merged audio track with a timeline
- Optional OCR to start from an image instead of text
- Context‑aware scene segmentation and detail‑preserving visual prompts (GPT‑4o)
- Image generation providers:
- Replicate: Flux 1.1 Pro (default), 16:9 targeting with AR/size fallbacks
- OpenAI: gpt‑image‑1 with supported sizes and automatic fallback
- Narration:
- Per‑scene TTS (OpenAI gpt‑4o‑mini‑tts)
- Single merged MP3 with timestamps (ffmpeg concat demuxer)
- Live progress via SSE (/generate_visuals_events)
- OCR routes: generate from image URL or upload
- Absolute asset URLs using PUBLIC_BASE_URL (e.g., ngrok) for frontend access
Text / Image → OCR (optional)
↓
Scene segmentation (GPT‑4o)
↓
Detail‑preserving visual prompts
↓
Image generation (Replicate Flux or OpenAI gpt‑image‑1)
↓
TTS per scene → ffmpeg concat → single audio + timeline
↓
Frontend (React) consumes JSON, images, audio, and SSE
good-vibes-only/
├── backend/
│ ├── main.py # FastAPI app (SSE, OCR, TTS, visuals)
│ ├── image_gen.py # Image provider adapters (Replicate/OpenAI)
│ ├── tts.py # OpenAI TTS + ffmpeg merge
│ ├── settings.py # Pydantic settings + .env loader
│ ├── pyproject.toml # Backend deps (use uv/pip)
│ └── uv.lock
└── frontend/ # React app that calls the backend
- Python 3.10+ (tested up to 3.13)
- ffmpeg installed (required for merged audio)
- macOS:
brew install ffmpeg
- macOS:
- Provider keys as needed:
- Replicate:
REPLICATE_API_TOKEN - OpenAI:
OPENAI_API_KEY
- Replicate:
From the repo root:
# 1) Install deps (using uv)
uv sync && cd ..
# 2) Create backend/.env with your keys and config (see below)
# 3) Run the API from the repo root
uv run uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reload# LLM
OPENAI_API_KEY=sk-...
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini
# Image provider (replicate | openai)
IMAGE_PROVIDER=replicate
REPLICATE_API_TOKEN=r8_...
REPLICATE_MODEL=black-forest-labs/flux-1.1-pro
REPLICATE_ASPECT_RATIO=16:9
# OpenAI Images (if IMAGE_PROVIDER=openai)
OPENAI_IMAGE_MODEL=gpt-image-1
OPENAI_IMAGE_SIZE=1536x1024 # allowed: 1024x1024, 1024x1536, 1536x1024, auto
# TTS
TTS_PROVIDER=openai
TTS_MODEL=gpt-4o-mini-tts
TTS_VOICE=alloy
TTS_OUTPUT_DIR=/tmp/seequence_audio
# Absolute URLs for frontend (ngrok/domain)
PUBLIC_BASE_URL=https://<your-ngrok-subdomain>.ngrok-free.dev
# CORS (optional – include your frontend origin when using credentials)
CORS_ORIGINS=https://<your-ngrok-subdomain>.ngrok-free.dev
# Health
curl -sS http://127.0.0.1:8000/health
# One image (provider-dependent)
curl -sS http://127.0.0.1:8000/generate_image \
-H "Content-Type: application/json" \
-d '{
"prompt": "Clean educational infographic showing 1 AU ≈ 1.496e8 km. Label Earth and Sun. High contrast."
}'
# Visuals + merged audio
curl -sS http://127.0.0.1:8000/generate_visuals_single_audio \
-H "Content-Type: application/json" \
-d '{ "text": "The Sun is a G-type star...", "max_scenes": 5 }'Configure your frontend to call the backend base URL (e.g., PUBLIC_BASE_URL).
Typical React workflow:
cd frontend
pnpm install
pnpm devEnsure your frontend uses absolute URLs from the backend responses (e.g., image_url, audio_url), which already include the PUBLIC_BASE_URL when set.
If your frontend needs an explicit base URL, set it (e.g., Vite):
# .env.local in frontend (example)
VITE_API_BASE=https://<your-ngrok-subdomain>.ngrok-free.devThe backend can run either:
- Imperative flow (default): sequential segmentation → prompts → images
- LangGraph flow: graph-based orchestration
Enable LangGraph by setting an env var and restarting the server:
export PIPELINE_ENGINE=langgraph
uv run uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reloadEndpoints are the same (e.g., POST /generate_visuals), but execution uses the graph.
- POST
/generate_visuals→ scenes with image URLs and a title - POST
/generate_visuals_with_audio→ scenes + per‑scene audio URLs + durations - POST
/generate_visuals_single_audio→ mergedaudio_url, total duration, timeline, scenes - GET
/generate_visuals_events→ Server‑Sent Events stream for progress - POST
/visuals_from_image_urland/visuals_from_image_upload→ OCR then visuals
Below is a concise reference of the backend API: inputs, outputs, and example usage. All JSON responses use UTF‑8 and stable keys. When PUBLIC_BASE_URL is configured, media paths (e.g., /static/...) are returned as absolute URLs.
- GET
/health- Response:
{ "status": "ok" }
- Response:
- POST
/segment- Body:
{ text: string, max_scenes?: number } - Response:
{ scenes: Array<{ scene_id: number, scene_summary: string, source_sentence_indices?: number[], source_sentences?: string[] }> }
- Body:
- POST
/generate_image- Body:
{ prompt: string, seed?: number } - Response:
{ image_url: string }(absolute ifPUBLIC_BASE_URLset)
- Body:
- POST
/generate_visuals- Body:
{ text: string, max_scenes?: number } - Response:
{ title?: string, scenes: Array<{ scene_id, scene_summary, prompt, image_url, source_sentence_indices?, source_sentences? }> }
- Body:
- POST
/generate_visuals_with_audio- Body:
{ text: string, max_scenes?: number } - Response:
{ title?: string, scenes: Array<{ scene_id, scene_summary, prompt, image_url, source_sentence_indices?, source_sentences?, audio_url, audio_duration_seconds }> } - Notes:
audio_urlpoints to/static/audio/...; duration is seconds (float).
- Body:
- POST
/generate_visuals_single_audio- Body:
{ text: string, max_scenes?: number } - Response:
{ title?: string, scenes: Array<{ scene_id, scene_summary, prompt, image_url, source_sentence_indices?, source_sentences? }>, audio_url: string, duration_seconds: number, timeline: Array<{ scene_id: number, start_sec: number, duration_sec: number }> } - Notes: merged MP3 generated with ffmpeg concat demuxer.
- Body:
- GET
/generate_visuals_events?text=...&max_scenes=8- Event stream content‑type:
text/event-stream - Events (event name → data JSON):
started→{ message: "begin" }segmented→{ count: number }summarized→{ has_summary: boolean }prompt→{ scene_id, prompt }image:started→{ scene_id }image:done→{ scene_id, image_url }complete→{ title?: string, scenes: Array<{ scene_id, scene_summary, prompt, image_url, source_sentence_indices?, source_sentences? }> }
- Tip: With ngrok, append
?ngrok-skip-browser-warning=trueto the URL to avoid the interstitial for EventSource.
- Event stream content‑type:
- GET
/generate_visuals_single_audio_events?text=...&max_scenes=8- Events:
started,segmented,summarized,prompt,image:started,image:donetts:started→{ scene_id }tts:done→{ scene_id, audio_url, duration_sec }tts:merge_started→{ count }tts:merge_done→{ audio_url, duration_seconds, timeline }complete→{ title?: string, scenes: Array<{ scene_id, scene_summary, prompt, image_url, source_sentence_indices?, source_sentences? }>, audio_url, duration_seconds, timeline }
- Events:
- POST
/ocr_from_image_url- Body:
{ image_url: string, prompt_hint?: string } - Response:
{ extracted_text: string }
- Body:
- POST
/ocr_from_image_upload(multipart)- Form:
file=@path/to/img, content‑typemultipart/form-data - Response:
{ extracted_text: string }
- Form:
- POST
/visuals_from_image_url- Body:
{ image_url: string, max_scenes?: number, prompt_hint?: string } - Response:
{ scenes: [...] }(same shape as/generate_visuals)
- Body:
- POST
/visuals_from_image_upload(multipart)- Form:
file=@path,max_scenes? - Response:
{ extracted_text: string, result: { title?: string, scenes: [...] } }
- Form:
- GET
/tts/diag→{ mutagen: boolean, tinytag: boolean, tts_output_dir: string } - GET
/tts/duration?file=<name>→{ file, exists, duration, size } - GET
/debug/audio_info?file=<name>→{ file, exists, size, mtime, path, public_url, public_base_url } - GET
/debug/audios→{ count, items: Array<{ file, size, mtime, url }> }(latest first) - GET
/debug/images→{ count, items: Array<{ file, size, mtime, url }> }(latest first) - GET
/debug/storage→ summary of media dirs, counts, and public mounts
The snippets below assume the API is running at http://127.0.0.1:8000. If you exposed it via ngrok, replace with your PUBLIC_BASE_URL.
curl -sS http://127.0.0.1:8000/generate_image \
-H "Content-Type: application/json" \
-d '{
"prompt": "Clean educational infographic showing 1 AU ≈ 1.496e8 km. Label Earth and Sun. High contrast.",
"seed": 42
}'curl -sS http://127.0.0.1:8000/generate_visuals \
-H "Content-Type: application/json" \
-d '{ "text": "The Sun is a G-type star...", "max_scenes": 5 }'curl -sS http://127.0.0.1:8000/generate_visuals_with_audio \
-H "Content-Type: application/json" \
-d '{ "text": "The Sun is a G-type star...", "max_scenes": 5 }'curl -sS http://127.0.0.1:8000/generate_visuals_single_audio \
-H "Content-Type: application/json" \
-d '{ "text": "The Sun is a G-type star...", "max_scenes": 5 }'# -N disables buffering to stream events as they arrive
curl -N "http://127.0.0.1:8000/generate_visuals_events?text=The%20Sun%20is%20a%20G-type%20star...&max_scenes=5"curl -N "http://127.0.0.1:8000/generate_visuals_single_audio_events?text=The%20Sun%20is%20a%20G-type%20star...&max_scenes=5"# From a public image URL
curl -sS http://127.0.0.1:8000/ocr_from_image_url \
-H "Content-Type: application/json" \
-d '{ "image_url": "https://example.com/page.png", "prompt_hint": "School lecture page" }'
# Upload a local image file
curl -sS -X POST http://127.0.0.1:8000/ocr_from_image_upload \
-F file=@/path/to/page.pngcurl -sS http://127.0.0.1:8000/visuals_from_image_url \
-H "Content-Type: application/json" \
-d '{ "image_url": "https://example.com/page.png", "max_scenes": 5 }'
curl -sS -X POST http://127.0.0.1:8000/visuals_from_image_upload \
-F file=@/path/to/page.png \
-F max_scenes=5curl -sS http://127.0.0.1:8000/tts/diag
curl -sS "http://127.0.0.1:8000/tts/duration?file=scene_1_alloy_xxx.mp3"
curl -sS "http://127.0.0.1:8000/debug/audio_info?file=scene_1_alloy_xxx.mp3"
curl -sS http://127.0.0.1:8000/debug/audios
curl -sS http://127.0.0.1:8000/debug/images
curl -sS http://127.0.0.1:8000/debug/storage- Audio fails to load after revisiting a story
- Make sure
PUBLIC_BASE_URLpoints to your current public URL (ngrok URL may change) - Store TTS files in a stable directory (
TTS_OUTPUT_DIR); the backend serves it under/static/audio
- Make sure
- OpenAI Images error: invalid size
- Use one of:
1024x1024,1024x1536,1536x1024, orauto(seeOPENAI_IMAGE_SIZE)
- Use one of:
- Replicate credit errors
- 402 Insufficient credit → top up your Replicate account
- Mixed content blocked
- Use HTTPS for both frontend and backend (ngrok URL is HTTPS)
- CORS
- Global CORS is enabled; if using credentials, set
CORS_ORIGINSto your frontend origin
- Global CORS is enabled; if using credentials, set
MIT License © 2025 Visurai Team
Made with care for learners who think in pictures.

