Skip to content

Latest commit

 

History

History
182 lines (151 loc) · 9.75 KB

File metadata and controls

182 lines (151 loc) · 9.75 KB

UN Web TV Transcribed

Browse and search UN Web TV videos with automatically generated transcripts, speaker identification, and topic analysis.

Live site: transcripts.un.org

Overview

This app scrapes UN Web TV (which has no public API), stores video metadata in PostgreSQL, and provides AI-powered transcription with speaker diarization, speaker identification, and topic analysis. Videos are displayed in a filterable table with real-time status tracking, search across the full archive, and individual video pages with embedded Kaltura player.

Features

  • Video schedule table with column filters, sorting, pagination, and global search (TanStack Table)
  • Full-archive search via PostgreSQL (beyond the rolling schedule window)
  • Embedded video pages with Kaltura player
  • AI transcription with per-language speech-to-text routing (AssemblyAI / Azure / Alibaba / Gemini), diarization, and paragraph breaks
  • Speaker identification via Azure OpenAI (maps speaker labels to named delegates)
  • Scheduled transcription for upcoming events (cron job picks them up when audio becomes available)
  • JSON API for programmatic access to video data
  • Status badges (Live / Scheduled / Finished) with smart sorting
  • Metadata extraction from titles (UN body, event code, session number, etc.)
  • API cost tracking per transcript (STT provider usage, OpenAI tokens)

Documentation

Detailed documentation lives in docs/:

Getting Started

pnpm install
cp .env.example .env.local   # fill in values
pnpm dev                     # http://localhost:3000

Commands

pnpm dev                      # Next.js dev server with Turbopack
pnpm build                    # Production build
pnpm lint                     # ESLint
pnpm typecheck                # TypeScript type-check (no emit)
pnpm format                   # Prettier

# Data management
pnpm sync-videos              # Sync video metadata from UN Web TV into PostgreSQL
pnpm fetch-video-metadata     # Dump stored video records to analysis/video-metadata.json
pnpm retranscribe             # Re-run transcription pipeline on stored transcripts
pnpm reidentify               # Re-run speaker identification on stored transcripts
pnpm usage-report             # Print API usage/cost report
pnpm usage-benchmark          # Compare/benchmark API pricing config

# Eval system (see eval/README.md)
pnpm eval -- --symbol=S/PV.9826 --providers=assemblyai --languages=en

Environment Variables

See .env.example for all variables. Core ones:

Variable Required Purpose
DATABASE_URL Yes PostgreSQL connection string
GEMINI_API_KEY Yes Floor transcription + PV alignment
ASSEMBLYAI_API_KEY Yes English transcription
DASHSCOPE_API_KEY Yes Chinese transcription (Fun-ASR)
AZURE_OPENAI_API_KEY Yes fr/es/ar/ru transcription + speaker ID
AZURE_OPENAI_ENDPOINT Yes as above
CRON_SECRET Production Vercel cron job auth

Tech Stack

  • Framework: Next.js 16 (App Router, Server Components, Turbopack)
  • Language: TypeScript 6
  • Styling: Tailwind CSS v4
  • UI: shadcn/ui, Lucide icons, Radix UI primitives
  • Table: TanStack Table v8
  • Database: PostgreSQL via pg connection pool
  • Transcription: per-language STT routing (AssemblyAI Universal-3 Pro, Azure gpt-4o-transcribe, Alibaba Fun-ASR, Gemini 3 Flash) — see lib/providers/config.ts
  • Speaker ID: Azure OpenAI (structured output via Zod)
  • Video hosting: Kaltura (partner ID: 2503451)
  • Deployment: Vercel — three cron jobs: process-scheduled every 5 min, sync-videos every 15 min, check-pv every 6 hours
  • Package manager: pnpm

Project Structure

app/
  page.tsx                          # Home page (server component, fetches schedule)
  [...meeting]/page.tsx             # Video page with player + transcript
  about/page.tsx                    # About page
  methodology/page.tsx              # Methodology page
  layout.tsx                        # Root layout (Roboto font, corner logo)
  globals.css                       # Tailwind v4 theme + UN color palette
  api/
    health/route.ts                 # DB health probe
    languages/route.ts              # Available audio languages for a Kaltura entry
    transcripts/route.ts            # Start or schedule transcription
    transcripts/check/route.ts      # Cache lookup for an existing transcript
    transcripts/[id]/route.ts       # Poll transcript status / fetch result
    transcripts/[id]/analysis/...   # Run proposition analysis
    search/route.ts                 # Full-archive video search
    pv/route.ts                     # Fetch + cache PV document JSON
    pv/align/route.ts               # Align PV document with audio (timestamps)
    cron/sync-videos/route.ts       # Cron: sync video metadata
    cron/process-scheduled/route.ts # Cron: process scheduled transcriptions
    cron/check-pv/route.ts          # Cron: check PV document availability
  json/
    route.ts                        # JSON API: video list
    [...meeting]/route.ts           # JSON API: single video

components/                         # Mixed PascalCase / kebab-case naming — match neighbours
  TranscriptTable.tsx               # Main schedule table (client, TanStack Table)
  SiteHeader.tsx                    # Header (home vs nav variants)
  NavMenu.tsx, TimezonePicker.tsx, AnimatedCornerLogo.tsx
  video-page-client.tsx             # Video page client wrapper
  transcription-panel.tsx           # Transcribe/poll/display flow
  transcript-view.tsx, transcript-toolbar.tsx, raw-transcript-view.tsx
  speaker-toc.tsx                   # Speaker table of contents
  pv-panel.tsx                      # Official verbatim record panel
  analysis-view.tsx                 # Propositions / stakeholder positions
  stage-progress.tsx                # Pipeline progress indicator
  video-player.tsx                  # Kaltura embedded player
  ui/                               # shadcn primitives

lib/
  db.ts                             # Database layer (all queries, pg pool, webtv.-qualified tables)
  cached-db.ts                      # next/cache wrappers for read-heavy queries
  un-api.ts                         # UN Web TV HTML scraper + metadata extraction
  transcription.ts                  # Transcription submission + audio URL resolution
  gemini-transcription.ts           # Gemini Files API transcription (chunked)
  pipeline/                         # Analysis pipeline stages (speaker ID, resegment, topics, propositions)
  speakers.ts                       # Speaker mapping CRUD
  usage-tracking.ts                 # API cost tracking (Gemini + OpenAI)
  pv-alignment.ts, pv-documents.ts, pv-parser.ts  # PV document pipeline
  kaltura.ts, kaltura-helpers.ts    # Kaltura entry ID resolution + audio URL
  meeting-slug.ts                   # Bidirectional slug ↔ document symbol conversion
  config.ts                         # App config (lookback days, Gemini pricing card)
  api-error.ts                      # Standardized API error responses
  languages.ts, country-lookup.ts, timezone.ts
  providers/                        # STT provider implementations (shared with eval/)
    registry.ts, config.ts, models.ts, types.ts, convert.ts
    gemini-production.ts, gemini.ts, assemblyai.ts, ...
  hooks/
    use-playback-tracking.ts, use-timezone.tsx
  load-env.ts                       # Loads .env.local for scripts outside Next.js

scripts/                            # CLI scripts (run via tsx, use lib/load-env)
  sync-videos.ts                    # Scrape UN Web TV → database
  fetch-video-metadata.ts           # Dump video records to analysis/
  retranscribe.ts                   # Re-run transcription on existing records
  reidentify.ts                     # Re-run speaker identification
  test-pv-parser.ts, test-pv-alignment.ts, compare-transcription.ts

sql/
  schema.sql                        # webtv schema, tables, indexes
  role.sql                          # webtv_app application role

docs/
  ai.md                             # AI pipeline: models, stages, design decisions
  webtv-kaltura.md                  # UN Web TV scraping & Kaltura three-ID system
  eval.md                           # Eval system: providers, metrics, corpus, dashboard
  official-transcripts.md           # PV vs SR records by UN organ
  api.md                            # Public JSON API + URL scheme
  TODO.md                           # Backlog notes

eval/                               # Independent eval harness (see docs/eval.md)
  dashboard/                        # Standalone Vite + React dashboard (npm, not pnpm)

REVIEW.md                           # Latest comprehensive code review (root)

Eval System

The eval/ directory is an independent benchmarking harness for transcription providers. It has its own tsconfig, is excluded from the root type-check, and the dashboard uses npm (not pnpm). See docs/eval.md for full details and eval/README.md for running instructions.