PDF Calendar Extractor
pwa · gemini ai · pdf.js ·
ics · csv · json ·
markdown · zero cost · privacy first
Beta Notice: ScrapeGoat is under active development and has not been fully tested in production. Expect rough edges. Bug reports and feedback are welcome via GitHub Issues.
You have a PDF schedule. You need calendar events. Existing tools don't work — they choke on weird date formats, multi-column layouts, and venue-specific quirks. ScrapeGoat fixes that.
Drop a PDF, answer a few multiple-choice questions, and get your events as ICS, CSV, JSON, or Markdown. Everything runs in your browser. Your files never leave your device.
- Drop a PDF — drag and drop any schedule into the browser
- Pick a template — use a saved template, browse community templates, or create a new one
- AI wizard — a guided multiple-choice interview builds a parsing template in about 2 minutes (no technical knowledge needed)
- Review events — check parsed results, flag issues, accept AI-suggested corrections
- Export — download as ICS, CSV, JSON, or Markdown with per-format options
Client-Side PDF Extraction
- Powered by Mozilla's PDF.js — runs entirely in your browser
- Multi-page extraction with page break markers
- Image-only PDF detection with clear error messages
- Multi-column layout detection with warnings
- Drag-and-drop or file picker, max 50 MB
AI-Powered Template Wizard
- Guided 6-step interview: document structure, date format, timezone, locations, status codes, event names
- Gemini 2.0 Flash powers the analysis via Cloudflare Worker proxy
- Turnstile bot protection — no API keys needed from users
- Correction flow: flag events, get AI alternatives, iterate up to 3 rounds
- AI-suggested template names
- Progressive timeout UX: 30s warning, 45s cancel option, 60s auto-timeout
- Graceful degradation: rate-limited or offline users can still use saved/community templates
Template System
- Zod-validated template profiles with block, table, and list parsers
- Save templates to browser localStorage
- Download and import templates as
.jsonfiles - Browse and search community templates from GitHub
- Share templates to the community via pre-filled GitHub Issues
Powerful Parser Engine
- Three structure types: block-based, table-based, and list-based
- Named capture groups for flexible date extraction
- Ambiguous date detection (e.g., is 1/2/2026 January 2 or February 1?)
- Known-values scan with regex fallback for location and status
- Custom field extraction via per-field regex patterns
- Post-processing: deduplication, sorting, date logic validation
Four Export Formats
- ICS — RFC 5545 compliant with VTIMEZONE, multi-phase support (Move-In/Event/Move-Out), line folding, STATUS mapping
- CSV — UTF-8 BOM for Excel, delimiter choice (comma/tab/semicolon), selectable columns
- JSON — matches internal event schema, null fields preserved, optional raw text
- Markdown — GFM table or list layout, human-readable dates, attribution footer
- Live preview before download
- All exports generated client-side
PWA & Offline Support
- Installable on mobile and desktop — add to home screen
- Service worker caches app shell for offline use
- Parsing and exporting work fully offline with saved templates
- Only the AI wizard requires an internet connection
- Offline detection hides the wizard when proxy is unreachable
Accessibility & Responsive Design
- Keyboard navigable with visible focus indicators
- Screen reader support: ARIA labels, roles, live regions
- Focus trap in modal dialogs
- Skip-to-content link
- Dark/light mode with system preference detection
- Responsive breakpoints: mobile (<640px), tablet (640-1024px), desktop (>1024px)
- Mobile card layout fallback for data tables
Screenshots will be added after the first public deployment.
Visit scrapegoat.pages.dev — no install required.
# Clone the repo
git clone https://github.com/Jason-Vaughan/ScrapeGoat.git
cd ScrapeGoat
# Install dependencies
npm install
# Copy environment variables
cp .env.example .env
# Start the dev server
npm run devThe app runs at http://localhost:5173. Parsing and exporting work without any API keys. To use the AI wizard locally, you'll need to set up the Cloudflare Worker — see the worker README or use the Turnstile test keys from .env.example.
# Unit tests
npm test
# E2E tests
npx playwright testCommunity templates let you parse common schedule formats without running the AI wizard.
Templates are listed on the template selection screen after uploading a PDF. You can search by name, source, or tags.
Click "Use" next to any community template. ScrapeGoat fetches the template JSON from GitHub and applies it to your PDF.
- Create a template using the AI wizard
- Click "Share" on any saved template
- Copy the JSON and open the pre-filled GitHub Issue
- The community reviews and merges it into
templates/index.json
See CONTRIBUTING.md for full details on template contributions.
Your files never leave your device. ScrapeGoat runs entirely in your browser. The only external call is to Google's AI during initial template setup — and even that only sends extracted text, not your file.
- No user accounts — nothing to sign up for
- No cookies or tracking — no analytics, no telemetry
- No file uploads — PDF extraction happens client-side via PDF.js
- AI calls are minimal — only extracted text is sent, only during the one-time wizard flow
- Templates are local — saved in your browser's localStorage, never sent to a server
See CONTRIBUTING.md for details on code contributions, template contributions, bug reports, and development setup.
MIT — Made by Jason Vaughan
