Skip to content

Transform any text into a visual storybook — empowering dyslexic and visual learners to understand through imagery. (Built during the Good Vibes Only AI/ML Buildathon @ USC.)

Notifications You must be signed in to change notification settings

masibasi/visurai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visurai — Visual Learning Copilot

Turn text into a narrated visual story: scenes, images, and audio — in seconds.

🏆 Built at the Good Vibes Only AI/ML Buildathon @ USC (2025)

🏆 1st place (Context Engineering), Software Engineering Award


Service Link

https://visurai-story-maker.lovable.app/

Overview

Project Demo : https://drive.google.com/file/d/16_YFVfVJoDPQqLkXXaRXSv_Dyr98bxey/view?usp=sharing

image

image

image

Visurai helps dyslexic and visual learners comprehend material by converting text into a sequence of AI-generated images with optional narration.

Paste any text and get:

  • A title and segmented scenes that preserve key facts and names
  • High-quality images per scene (Flux via Replicate or OpenAI gpt-image-1)
  • Per‑scene TTS audio and a single merged audio track with a timeline
  • Optional OCR to start from an image instead of text

Features

  • Context‑aware scene segmentation and detail‑preserving visual prompts (GPT‑4o)
  • Image generation providers:
    • Replicate: Flux 1.1 Pro (default), 16:9 targeting with AR/size fallbacks
    • OpenAI: gpt‑image‑1 with supported sizes and automatic fallback
  • Narration:
    • Per‑scene TTS (OpenAI gpt‑4o‑mini‑tts)
    • Single merged MP3 with timestamps (ffmpeg concat demuxer)
  • Live progress via SSE (/generate_visuals_events)
  • OCR routes: generate from image URL or upload
  • Absolute asset URLs using PUBLIC_BASE_URL (e.g., ngrok) for frontend access

Architecture

image image
Text / Image → OCR (optional)
				↓
Scene segmentation (GPT‑4o)
				↓
Detail‑preserving visual prompts
				↓
Image generation (Replicate Flux or OpenAI gpt‑image‑1)
				↓
TTS per scene → ffmpeg concat → single audio + timeline
				↓
Frontend (React) consumes JSON, images, audio, and SSE

Repository Structure

good-vibes-only/
├── backend/
│   ├── main.py            # FastAPI app (SSE, OCR, TTS, visuals)
│   ├── image_gen.py       # Image provider adapters (Replicate/OpenAI)
│   ├── tts.py             # OpenAI TTS + ffmpeg merge
│   ├── settings.py        # Pydantic settings + .env loader
│   ├── pyproject.toml     # Backend deps (use uv/pip)
│   └── uv.lock
└── frontend/              # React app that calls the backend

Prerequisites

  • Python 3.10+ (tested up to 3.13)
  • ffmpeg installed (required for merged audio)
    • macOS: brew install ffmpeg
  • Provider keys as needed:
    • Replicate: REPLICATE_API_TOKEN
    • OpenAI: OPENAI_API_KEY

Backend — Quick Start (run from repo root)

From the repo root:

# 1) Install deps (using uv)
uv sync && cd ..

# 2) Create backend/.env with your keys and config (see below)

# 3) Run the API from the repo root
uv run uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reload

backend/.env (example)

# LLM
OPENAI_API_KEY=sk-...
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini

# Image provider (replicate | openai)
IMAGE_PROVIDER=replicate
REPLICATE_API_TOKEN=r8_...
REPLICATE_MODEL=black-forest-labs/flux-1.1-pro
REPLICATE_ASPECT_RATIO=16:9

# OpenAI Images (if IMAGE_PROVIDER=openai)
OPENAI_IMAGE_MODEL=gpt-image-1
OPENAI_IMAGE_SIZE=1536x1024   # allowed: 1024x1024, 1024x1536, 1536x1024, auto

# TTS
TTS_PROVIDER=openai
TTS_MODEL=gpt-4o-mini-tts
TTS_VOICE=alloy
TTS_OUTPUT_DIR=/tmp/seequence_audio

# Absolute URLs for frontend (ngrok/domain)
PUBLIC_BASE_URL=https://<your-ngrok-subdomain>.ngrok-free.dev

# CORS (optional – include your frontend origin when using credentials)
CORS_ORIGINS=https://<your-ngrok-subdomain>.ngrok-free.dev

Verify

# Health
curl -sS http://127.0.0.1:8000/health

# One image (provider-dependent)
curl -sS http://127.0.0.1:8000/generate_image \
	-H "Content-Type: application/json" \
	-d '{
		"prompt": "Clean educational infographic showing 1 AU ≈ 1.496e8 km. Label Earth and Sun. High contrast."
	}'

# Visuals + merged audio
curl -sS http://127.0.0.1:8000/generate_visuals_single_audio \
	-H "Content-Type: application/json" \
	-d '{ "text": "The Sun is a G-type star...", "max_scenes": 5 }'

Frontend — Quick Start (pnpm)

Configure your frontend to call the backend base URL (e.g., PUBLIC_BASE_URL).

Typical React workflow:

cd frontend
pnpm install
pnpm dev

Ensure your frontend uses absolute URLs from the backend responses (e.g., image_url, audio_url), which already include the PUBLIC_BASE_URL when set.

If your frontend needs an explicit base URL, set it (e.g., Vite):

# .env.local in frontend (example)
VITE_API_BASE=https://<your-ngrok-subdomain>.ngrok-free.dev

Engine Switch: LangGraph vs Imperative

The backend can run either:

  • Imperative flow (default): sequential segmentation → prompts → images
  • LangGraph flow: graph-based orchestration

Enable LangGraph by setting an env var and restarting the server:

export PIPELINE_ENGINE=langgraph
uv run uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reload

Endpoints are the same (e.g., POST /generate_visuals), but execution uses the graph.


API Highlights

  • POST /generate_visuals → scenes with image URLs and a title
  • POST /generate_visuals_with_audio → scenes + per‑scene audio URLs + durations
  • POST /generate_visuals_single_audio → merged audio_url, total duration, timeline, scenes
  • GET /generate_visuals_events → Server‑Sent Events stream for progress
  • POST /visuals_from_image_url and /visuals_from_image_upload → OCR then visuals

API Reference

Below is a concise reference of the backend API: inputs, outputs, and example usage. All JSON responses use UTF‑8 and stable keys. When PUBLIC_BASE_URL is configured, media paths (e.g., /static/...) are returned as absolute URLs.

Health

  • GET /health
    • Response: { "status": "ok" }

Segment text into scenes

  • POST /segment
    • Body: { text: string, max_scenes?: number }
    • Response: { scenes: Array<{ scene_id: number, scene_summary: string, source_sentence_indices?: number[], source_sentences?: string[] }> }

Generate one image (test)

  • POST /generate_image
    • Body: { prompt: string, seed?: number }
    • Response: { image_url: string } (absolute if PUBLIC_BASE_URL set)

Full pipeline: text → images

  • POST /generate_visuals
    • Body: { text: string, max_scenes?: number }
    • Response: { title?: string, scenes: Array<{ scene_id, scene_summary, prompt, image_url, source_sentence_indices?, source_sentences? }> }

Full pipeline + per‑scene TTS

  • POST /generate_visuals_with_audio
    • Body: { text: string, max_scenes?: number }
    • Response: { title?: string, scenes: Array<{ scene_id, scene_summary, prompt, image_url, source_sentence_indices?, source_sentences?, audio_url, audio_duration_seconds }> }
    • Notes: audio_url points to /static/audio/...; duration is seconds (float).

Full pipeline + single merged audio

  • POST /generate_visuals_single_audio
    • Body: { text: string, max_scenes?: number }
    • Response: { title?: string, scenes: Array<{ scene_id, scene_summary, prompt, image_url, source_sentence_indices?, source_sentences? }>, audio_url: string, duration_seconds: number, timeline: Array<{ scene_id: number, start_sec: number, duration_sec: number }> }
    • Notes: merged MP3 generated with ffmpeg concat demuxer.

Progress (SSE) — images only

  • GET /generate_visuals_events?text=...&max_scenes=8
    • Event stream content‑type: text/event-stream
    • Events (event name → data JSON):
      • started{ message: "begin" }
      • segmented{ count: number }
      • summarized{ has_summary: boolean }
      • prompt{ scene_id, prompt }
      • image:started{ scene_id }
      • image:done{ scene_id, image_url }
      • complete{ title?: string, scenes: Array<{ scene_id, scene_summary, prompt, image_url, source_sentence_indices?, source_sentences? }> }
    • Tip: With ngrok, append ?ngrok-skip-browser-warning=true to the URL to avoid the interstitial for EventSource.

Progress (SSE) — images + per‑scene TTS + merged audio

  • GET /generate_visuals_single_audio_events?text=...&max_scenes=8
    • Events:
      • started, segmented, summarized, prompt, image:started, image:done
      • tts:started{ scene_id }
      • tts:done{ scene_id, audio_url, duration_sec }
      • tts:merge_started{ count }
      • tts:merge_done{ audio_url, duration_seconds, timeline }
      • complete{ title?: string, scenes: Array<{ scene_id, scene_summary, prompt, image_url, source_sentence_indices?, source_sentences? }>, audio_url, duration_seconds, timeline }

OCR only (image → text)

  • POST /ocr_from_image_url
    • Body: { image_url: string, prompt_hint?: string }
    • Response: { extracted_text: string }
  • POST /ocr_from_image_upload (multipart)
    • Form: file=@path/to/img, content‑type multipart/form-data
    • Response: { extracted_text: string }

One‑shot: image → OCR → visuals

  • POST /visuals_from_image_url
    • Body: { image_url: string, max_scenes?: number, prompt_hint?: string }
    • Response: { scenes: [...] } (same shape as /generate_visuals)
  • POST /visuals_from_image_upload (multipart)
    • Form: file=@path, max_scenes?
    • Response: { extracted_text: string, result: { title?: string, scenes: [...] } }

Debug utilities (optional)

  • GET /tts/diag{ mutagen: boolean, tinytag: boolean, tts_output_dir: string }
  • GET /tts/duration?file=<name>{ file, exists, duration, size }
  • GET /debug/audio_info?file=<name>{ file, exists, size, mtime, path, public_url, public_base_url }
  • GET /debug/audios{ count, items: Array<{ file, size, mtime, url }> } (latest first)
  • GET /debug/images{ count, items: Array<{ file, size, mtime, url }> } (latest first)
  • GET /debug/storage → summary of media dirs, counts, and public mounts

Examples

The snippets below assume the API is running at http://127.0.0.1:8000. If you exposed it via ngrok, replace with your PUBLIC_BASE_URL.

1) One image (curl)

curl -sS http://127.0.0.1:8000/generate_image \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Clean educational infographic showing 1 AU ≈ 1.496e8 km. Label Earth and Sun. High contrast.",
    "seed": 42
  }'

2) Full pipeline (POST)

curl -sS http://127.0.0.1:8000/generate_visuals \
  -H "Content-Type: application/json" \
  -d '{ "text": "The Sun is a G-type star...", "max_scenes": 5 }'

3) Full pipeline + per‑scene TTS (POST)

curl -sS http://127.0.0.1:8000/generate_visuals_with_audio \
  -H "Content-Type: application/json" \
  -d '{ "text": "The Sun is a G-type star...", "max_scenes": 5 }'

4) Full pipeline + single merged audio (POST)

curl -sS http://127.0.0.1:8000/generate_visuals_single_audio \
  -H "Content-Type: application/json" \
  -d '{ "text": "The Sun is a G-type star...", "max_scenes": 5 }'

5) SSE: images only (curl)

# -N disables buffering to stream events as they arrive
curl -N "http://127.0.0.1:8000/generate_visuals_events?text=The%20Sun%20is%20a%20G-type%20star...&max_scenes=5"

6) SSE: images + per‑scene TTS + merged audio (curl)

curl -N "http://127.0.0.1:8000/generate_visuals_single_audio_events?text=The%20Sun%20is%20a%20G-type%20star...&max_scenes=5"

7) OCR (curl)

# From a public image URL
curl -sS http://127.0.0.1:8000/ocr_from_image_url \
  -H "Content-Type: application/json" \
  -d '{ "image_url": "https://example.com/page.png", "prompt_hint": "School lecture page" }'

# Upload a local image file
curl -sS -X POST http://127.0.0.1:8000/ocr_from_image_upload \
  -F file=@/path/to/page.png

8) One‑shot: image → OCR → visuals (curl)

curl -sS http://127.0.0.1:8000/visuals_from_image_url \
  -H "Content-Type: application/json" \
  -d '{ "image_url": "https://example.com/page.png", "max_scenes": 5 }'

curl -sS -X POST http://127.0.0.1:8000/visuals_from_image_upload \
  -F file=@/path/to/page.png \
  -F max_scenes=5

9) Debug helpers (curl)

curl -sS http://127.0.0.1:8000/tts/diag
curl -sS "http://127.0.0.1:8000/tts/duration?file=scene_1_alloy_xxx.mp3"
curl -sS "http://127.0.0.1:8000/debug/audio_info?file=scene_1_alloy_xxx.mp3"
curl -sS http://127.0.0.1:8000/debug/audios
curl -sS http://127.0.0.1:8000/debug/images
curl -sS http://127.0.0.1:8000/debug/storage

Troubleshooting

  • Audio fails to load after revisiting a story
    • Make sure PUBLIC_BASE_URL points to your current public URL (ngrok URL may change)
    • Store TTS files in a stable directory (TTS_OUTPUT_DIR); the backend serves it under /static/audio
  • OpenAI Images error: invalid size
    • Use one of: 1024x1024, 1024x1536, 1536x1024, or auto (see OPENAI_IMAGE_SIZE)
  • Replicate credit errors
    • 402 Insufficient credit → top up your Replicate account
  • Mixed content blocked
    • Use HTTPS for both frontend and backend (ngrok URL is HTTPS)
  • CORS
    • Global CORS is enabled; if using credentials, set CORS_ORIGINS to your frontend origin

License

MIT License © 2025 Visurai Team


Made with care for learners who think in pictures.

About

Transform any text into a visual storybook — empowering dyslexic and visual learners to understand through imagery. (Built during the Good Vibes Only AI/ML Buildathon @ USC.)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published