Skip to content

united-nations/transcripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

296 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UN Web TV Transcribed

Browse and search UN Web TV videos with AI-generated transcripts, speaker identification, and topic analysis.

Live site: transcripts-un-two-zero.org

Overview

This app scrapes UN Web TV (which has no public API), stores video metadata in PostgreSQL, and provides AI-powered transcription with speaker diarization, speaker identification, and topic analysis. Videos are displayed in a filterable table with real-time status tracking, search across the full archive, and individual video pages with embedded Kaltura player.

Features

  • Video schedule table with column filters, sorting, pagination, and global search (TanStack Table)
  • Full-archive search via PostgreSQL (beyond the rolling schedule window)
  • Embedded video pages with Kaltura player
  • AI transcription with per-language speech-to-text routing (AssemblyAI / Azure / Alibaba / Gemini), diarization, and paragraph breaks
  • Speaker identification via Azure OpenAI (maps speaker labels to named delegates)
  • Scheduled transcription for upcoming events (cron job picks them up when audio becomes available)
  • JSON API for programmatic access to video data
  • Status badges (Live / Scheduled / Finished) with smart sorting
  • Metadata extraction from titles (UN body, event code, session number, etc.)
  • API cost tracking per transcript (STT provider usage, OpenAI tokens)

Documentation

Detailed documentation lives in docs/:

Getting Started

pnpm install
cp .env.example .env.local   # fill in values
pnpm dev                     # http://localhost:3000

Commands

pnpm dev                      # Next.js dev server with Turbopack
pnpm build                    # Production build
pnpm lint                     # ESLint
pnpm typecheck                # TypeScript type-check (no emit)
pnpm format                   # Prettier

# Data management
pnpm sync-videos              # Sync video metadata from UN Web TV into PostgreSQL
pnpm fetch-video-metadata     # Dump stored video records to analysis/video-metadata.json
pnpm retranscribe             # Re-run transcription pipeline on stored transcripts
pnpm reidentify               # Re-run speaker identification on stored transcripts
pnpm usage-report             # Print API usage/cost report
pnpm usage-benchmark          # Compare/benchmark API pricing config

# Eval system (see eval/README.md)
pnpm eval -- --symbol=S/PV.9826 --providers=assemblyai --languages=en

Environment Variables

See .env.example for all variables. Core ones:

Variable Required Purpose
DATABASE_URL Yes PostgreSQL connection string
GEMINI_API_KEY Yes Floor transcription + PV alignment
ASSEMBLYAI_API_KEY Yes English transcription
DASHSCOPE_API_KEY Yes Chinese transcription (Fun-ASR)
AZURE_OPENAI_API_KEY Yes fr/es/ar/ru transcription + speaker ID
AZURE_OPENAI_ENDPOINT Yes as above
CRON_SECRET Production Vercel cron job auth

Tech Stack

  • Framework: Next.js 16 (App Router, Server Components, Turbopack)
  • Language: TypeScript 6
  • Styling: Tailwind CSS v4
  • UI: shadcn/ui, Lucide icons, Radix UI primitives
  • Table: TanStack Table v8
  • Database: PostgreSQL via pg connection pool
  • Transcription: per-language STT routing (AssemblyAI Universal-3 Pro, Azure gpt-4o-transcribe, Alibaba Fun-ASR, Gemini 3 Flash) — see lib/providers/config.ts
  • Speaker ID: Azure OpenAI (structured output via Zod)
  • Video hosting: Kaltura (partner ID: 2503451)
  • Deployment: Vercel — three cron jobs: process-scheduled every 5 min, sync-videos every 15 min, check-pv every 6 hours
  • Package manager: pnpm

Project Structure

app/
  page.tsx                          # Home page (server component, fetches schedule)
  [...meeting]/page.tsx             # Video page with player + transcript
  about/page.tsx                    # About page
  methodology/page.tsx              # Methodology page
  layout.tsx                        # Root layout (Roboto font, corner logo)
  globals.css                       # Tailwind v4 theme + UN color palette
  api/
    health/route.ts                 # DB health probe
    languages/route.ts              # Available audio languages for a Kaltura entry
    transcripts/route.ts            # Start or schedule transcription
    transcripts/check/route.ts      # Cache lookup for an existing transcript
    transcripts/[id]/route.ts       # Poll transcript status / fetch result
    transcripts/[id]/analysis/...   # Run proposition analysis
    search/route.ts                 # Full-archive video search
    pv/route.ts                     # Fetch + cache PV document JSON
    pv/align/route.ts               # Align PV document with audio (timestamps)
    cron/sync-videos/route.ts       # Cron: sync video metadata
    cron/process-scheduled/route.ts # Cron: process scheduled transcriptions
    cron/check-pv/route.ts          # Cron: check PV document availability
  json/
    route.ts                        # JSON API: video list
    [...meeting]/route.ts           # JSON API: single video

components/                         # Mixed PascalCase / kebab-case naming — match neighbours
  TranscriptTable.tsx               # Main schedule table (client, TanStack Table)
  SiteHeader.tsx                    # Header (home vs nav variants)
  NavMenu.tsx, TimezonePicker.tsx, AnimatedCornerLogo.tsx
  video-page-client.tsx             # Video page client wrapper
  transcription-panel.tsx           # Transcribe/poll/display flow
  transcript-view.tsx, transcript-toolbar.tsx, raw-transcript-view.tsx
  speaker-toc.tsx                   # Speaker table of contents
  pv-panel.tsx                      # Official verbatim record panel
  analysis-view.tsx                 # Propositions / stakeholder positions
  stage-progress.tsx                # Pipeline progress indicator
  video-player.tsx                  # Kaltura embedded player
  ui/                               # shadcn primitives

lib/
  db.ts                             # Database layer (all queries, pg pool, webtv.-qualified tables)
  cached-db.ts                      # next/cache wrappers for read-heavy queries
  un-api.ts                         # UN Web TV HTML scraper + metadata extraction
  transcription.ts                  # Transcription submission + audio URL resolution
  gemini-transcription.ts           # Gemini Files API transcription (chunked)
  pipeline/                         # Analysis pipeline stages (speaker ID, resegment, topics, propositions)
  speakers.ts                       # Speaker mapping CRUD
  usage-tracking.ts                 # API cost tracking (Gemini + OpenAI)
  pv-alignment.ts, pv-documents.ts, pv-parser.ts  # PV document pipeline
  kaltura.ts, kaltura-helpers.ts    # Kaltura entry ID resolution + audio URL
  meeting-slug.ts                   # Bidirectional slug ↔ document symbol conversion
  config.ts                         # App config (lookback days, Gemini pricing card)
  api-error.ts                      # Standardized API error responses
  languages.ts, country-lookup.ts, timezone.ts
  providers/                        # STT provider implementations (shared with eval/)
    registry.ts, config.ts, models.ts, types.ts, convert.ts
    gemini-production.ts, gemini.ts, assemblyai.ts, ...
  hooks/
    use-playback-tracking.ts, use-timezone.tsx
  load-env.ts                       # Loads .env.local for scripts outside Next.js

scripts/                            # CLI scripts (run via tsx, use lib/load-env)
  sync-videos.ts                    # Scrape UN Web TV → database
  fetch-video-metadata.ts           # Dump video records to analysis/
  retranscribe.ts                   # Re-run transcription on existing records
  reidentify.ts                     # Re-run speaker identification
  test-pv-parser.ts, test-pv-alignment.ts, compare-transcription.ts

sql/
  schema.sql                        # webtv schema, tables, indexes
  role.sql                          # webtv_app application role

docs/
  ai.md                             # AI pipeline: models, stages, design decisions
  webtv-kaltura.md                  # UN Web TV scraping & Kaltura three-ID system
  eval.md                           # Eval system: providers, metrics, corpus, dashboard
  official-transcripts.md           # PV vs SR records by UN organ
  api.md                            # Public JSON API + URL scheme
  TODO.md                           # Backlog notes

eval/                               # Independent eval harness (see docs/eval.md)
  dashboard/                        # Standalone Vite + React dashboard (npm, not pnpm)

REVIEW.md                           # Latest comprehensive code review (root)

Eval System

The eval/ directory is an independent benchmarking harness for transcription providers. It has its own tsconfig, is excluded from the root type-check, and the dashboard uses npm (not pnpm). See docs/eval.md for full details and eval/README.md for running instructions.

About

Browse and search UN Web TV videos with AI-generated transcripts, speaker identification, and topic analysis.

Topics

Resources

Stars

Watchers

Forks

Contributors