GitHub - Jason-Vaughan/ScrapeGoat: [BETA] PDF Calendar Extractor — Turn any PDF schedule into ICS, CSV, JSON, or Markdown events. Privacy-first PWA.

PDF Calendar Extractor

pwa · gemini ai · pdf.js · ics · csv · json · markdown · zero cost · privacy first

Beta Notice: ScrapeGoat is under active development and has not been fully tested in production. Expect rough edges. Bug reports and feedback are welcome via GitHub Issues.

What is ScrapeGoat?

You have a PDF schedule. You need calendar events. Existing tools don't work — they choke on weird date formats, multi-column layouts, and venue-specific quirks. ScrapeGoat fixes that.

Drop a PDF, answer a few multiple-choice questions, and get your events as ICS, CSV, JSON, or Markdown. Everything runs in your browser. Your files never leave your device.

How It Works

Drop a PDF — drag and drop any schedule into the browser
Pick a template — use a saved template, browse community templates, or create a new one
AI wizard — a guided multiple-choice interview builds a parsing template in about 2 minutes (no technical knowledge needed)
Review events — check parsed results, flag issues, accept AI-suggested corrections
Export — download as ICS, CSV, JSON, or Markdown with per-format options

Features

Client-Side PDF Extraction

Powered by Mozilla's PDF.js — runs entirely in your browser
Multi-page extraction with page break markers
Image-only PDF detection with clear error messages
Multi-column layout detection with warnings
Drag-and-drop or file picker, max 50 MB

AI-Powered Template Wizard

Guided 6-step interview: document structure, date format, timezone, locations, status codes, event names
Gemini 2.0 Flash powers the analysis via Cloudflare Worker proxy
Turnstile bot protection — no API keys needed from users
Correction flow: flag events, get AI alternatives, iterate up to 3 rounds
AI-suggested template names
Progressive timeout UX: 30s warning, 45s cancel option, 60s auto-timeout
Graceful degradation: rate-limited or offline users can still use saved/community templates

Template System

Zod-validated template profiles with block, table, and list parsers
Save templates to browser localStorage
Download and import templates as .json files
Browse and search community templates from GitHub
Share templates to the community via pre-filled GitHub Issues

Powerful Parser Engine

Three structure types: block-based, table-based, and list-based
Named capture groups for flexible date extraction
Ambiguous date detection (e.g., is 1/2/2026 January 2 or February 1?)
Known-values scan with regex fallback for location and status
Custom field extraction via per-field regex patterns
Post-processing: deduplication, sorting, date logic validation

Four Export Formats

ICS — RFC 5545 compliant with VTIMEZONE, multi-phase support (Move-In/Event/Move-Out), line folding, STATUS mapping
CSV — UTF-8 BOM for Excel, delimiter choice (comma/tab/semicolon), selectable columns
JSON — matches internal event schema, null fields preserved, optional raw text
Markdown — GFM table or list layout, human-readable dates, attribution footer
Live preview before download
All exports generated client-side

PWA & Offline Support

Installable on mobile and desktop — add to home screen
Service worker caches app shell for offline use
Parsing and exporting work fully offline with saved templates
Only the AI wizard requires an internet connection
Offline detection hides the wizard when proxy is unreachable

Accessibility & Responsive Design

Keyboard navigable with visible focus indicators
Screen reader support: ARIA labels, roles, live regions
Focus trap in modal dialogs
Skip-to-content link
Dark/light mode with system preference detection
Responsive breakpoints: mobile (<640px), tablet (640-1024px), desktop (>1024px)
Mobile card layout fallback for data tables

Screenshots

Screenshots will be added after the first public deployment.

Quick Start

Use the hosted version

Visit scrapegoat.pages.dev — no install required.

Run locally for development

# Clone the repo
git clone https://github.com/Jason-Vaughan/ScrapeGoat.git
cd ScrapeGoat

# Install dependencies
npm install

# Copy environment variables
cp .env.example .env

# Start the dev server
npm run dev

The app runs at http://localhost:5173. Parsing and exporting work without any API keys. To use the AI wizard locally, you'll need to set up the Cloudflare Worker — see the worker README or use the Turnstile test keys from .env.example.

Run tests

# Unit tests
npm test

# E2E tests
npx playwright test

Community Templates

Community templates let you parse common schedule formats without running the AI wizard.

Browse templates

Templates are listed on the template selection screen after uploading a PDF. You can search by name, source, or tags.

Use a community template

Click "Use" next to any community template. ScrapeGoat fetches the template JSON from GitHub and applies it to your PDF.

Contribute a template

Create a template using the AI wizard
Click "Share" on any saved template
Copy the JSON and open the pre-filled GitHub Issue
The community reviews and merges it into templates/index.json

See CONTRIBUTING.md for full details on template contributions.

Privacy

Your files never leave your device. ScrapeGoat runs entirely in your browser. The only external call is to Google's AI during initial template setup — and even that only sends extracted text, not your file.

No user accounts — nothing to sign up for
No cookies or tracking — no analytics, no telemetry
No file uploads — PDF extraction happens client-side via PDF.js
AI calls are minimal — only extracted text is sent, only during the one-time wizard flow
Templates are local — saved in your browser's localStorage, never sent to a server

Contributing

See CONTRIBUTING.md for details on code contributions, template contributions, bug reports, and development setup.

License

MIT — Made by Jason Vaughan

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.claude		.claude
.github		.github
e2e		e2e
functions		functions
public		public
src		src
templates		templates
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SCRAPEGOAT_SPEC.md		SCRAPEGOAT_SPEC.md
eslint.config.js		eslint.config.js
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
playwright.config.ts		playwright.config.ts
stats.json		stats.json
tsconfig.app.json		tsconfig.app.json
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
vite.config.ts		vite.config.ts
vitest.config.ts		vitest.config.ts
wrangler.toml		wrangler.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is ScrapeGoat?

How It Works

Features

Screenshots

Quick Start

Use the hosted version

Run locally for development

Run tests

Community Templates

Browse templates

Use a community template

Contribute a template

Privacy

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What is ScrapeGoat?

How It Works

Features

Screenshots

Quick Start

Use the hosted version

Run locally for development

Run tests

Community Templates

Browse templates

Use a community template

Contribute a template

Privacy

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages