Smart Elections Parser is a robust, modular, and integrity-focused precinct-level election result scraper and analyzer. It is designed to adapt to the ever-changing landscape of U.S. election reporting, supporting both traditional and modern web formats, and is built for extensibility, transparency, and auditability.
Consolidated documentation in
docs/folder - All ~50 markdown files have been organized into clear categories with cross-links.
| Purpose | Document | Read Time |
|---|---|---|
| 🚀 Getting Started | DEPLOYMENT_GUIDE.md | 10 min |
| 🏗️ Architecture & Design | architecture.md | 20 min |
| 📋 Quick Reference | QUICK_REFERENCES.md | Lookup |
| 🔐 Quarantine System | QUARANTINE_SYSTEM_GUIDE.md | 15 min |
| ✅ System Governance | SYSTEM_GOVERNANCE.md | 15 min |
| 🔍 Verification Framework | VERIFICATION_ARCHITECTURE.md | 20 min |
| 🔐 Certificate Auth | CERT_AUTH_IMPLEMENTATION.md | 15 min |
| ☁️ Azure Deployment | AZURE_DEPLOYMENT_CHECKLIST.md | 10 min |
docs/
├── architecture.md # Core system design & data flow
├── index.md # Main documentation hub
├── DEPLOYMENT_GUIDE.md # Local/Docker/Azure setup
├── QUICK_REFERENCES.md # API/CLI quick lookup
├── QUARANTINE_SYSTEM_GUIDE.md # Transparent quarantine pipeline
├── SYSTEM_GOVERNANCE.md # Ethical principles & privilege model
├── VERIFICATION_ARCHITECTURE.md # Verification framework design
├── WAREHOUSE_VERIFICATION_GUIDE.md # Gated warehouse verification
├── CERTAINTY_CAUTION_FRAMEWORK.md # Confidence scoring system
├── CERT_AUTH_IMPLEMENTATION.md # Certificate authentication
├── CERT_AUTH_REFERENCE.md # Cert auth quick reference
├── CERT_AUTH_STEP5_CHECKLIST.md # Phase 5 implementation tasks
├── AZURE_DEPLOYMENT_CHECKLIST.md # Azure production checklist
├── ELECTION_OPERATIONS_PLAYBOOK.md # Operational procedures
├── handlers.md # Handler architecture & routing
├── fec_fuzzy.md # FEC candidate fuzzy matching
├── session-logs/ # Archived session reports (by date)
├── implementation-phases/ # Archived phase completion reports
├── implementation-history/ # Archived implementation records
└── archived/ # Historical/deprecated docs
👨💻 Developers
- Start: architecture.md (overview of system layers)
- Reference: QUICK_REFERENCES.md (APIs and common tasks)
- Deep Dive: handlers.md (handler routing and patterns)
🚀 DevOps/Deployment
- Start: DEPLOYMENT_GUIDE.md (local → Docker → Azure)
- Reference: AZURE_DEPLOYMENT_CHECKLIST.md (production checklist)
- Troubleshooting: DEPLOYMENT_GUIDE.md
🔐 Security/Compliance
- Start: SYSTEM_GOVERNANCE.md (ethical principles)
- Deep Dive: VERIFICATION_ARCHITECTURE.md (verification framework)
- Operational: ELECTION_OPERATIONS_PLAYBOOK.md (procedures)
👥 Project Managers/Stakeholders
- Start: QUARANTINE_SYSTEM_GUIDE.md (transparency overview)
- Status: CERT_AUTH_IMPLEMENTATION.md (current phase status)
- Reference: CURRENT_SESSION_INDEX.md (work tracking)
-
Adaptive Navigation for Election Pages
- Autoscroll now tracks tables seen and stops when no new tables load, logging telemetry to tune timeouts.
- Navigator consumes
navigation_keyword_bias.jsonlplus new precinct/county recipes to open election tabs before scrolling. - HTML fallback prefers in-DOM table extraction before prompting for downloads when both are present.
-
HTML-First Parsing + Context Bridge
- HTML handling is the primary path for DOM-based election sites.
context_organizer.pymaps DOM skeletons into context entries used by the router and handlers.- The bridge into format handlers is gated by available context and confirmed election signals.
-
Dynamic Table Extraction & Structure Learning
- Centralized in
table_core.pyanddynamic_table_extractor.py - Multi-strategy extraction: HTML tables, repeated DOM, pattern-based, ML/NLP, and plugin-based
- Table structure learning, harmonization, and feedback are now fully centralized
- ML/NER-powered entity annotation and structure verification
- Dynamic scoring and patching: extraction methods are scored and can "fill in the blanks" using information from other strategies
- Centralized in
-
Navigation Feedback Loop → Manual Correction
- Every navigation run logs per-step telemetry to
log/navigation_learning_log.jsonlviaContextCoordinator.record_navigation_feedback(). webapp/parser/health/navigation_feedback_ingest.pyconverts the log intonavigation_feedback_selection_log.jsonl, so the manual correction bot can auto-review successes/failures, feed ML retraining, or fast-track new recipes without extra tooling.
- Every navigation run logs per-step telemetry to
-
Azure Health Control Center
/health_dashboardsurfaces an internal operations console: launch health tasks (manual correction, log/cache cleanup, dataset promotion), review logs, and audit system safeguards—accessible whenENABLE_HEALTH_TASKS=true.- Each job streams stdout to the browser so you can supervise Azure deployments even when shell access is limited.
-
Context-Aware Orchestration
context_coordinator.pyandcontext_organizer.pyorchestrate advanced context analysis, NLP, and ML integrity checks- Persistent context library (
context_library.json) for learning from user feedback and corrections - Automated anomaly detection, clustering, and integrity checks (see
Integrity_check.py)
-
Web UI & CLI Parity
- Flask-based web interface for managing URLs, running the parser, and reviewing output
- Real-time log streaming via SocketIO
- Data management dashboard for uploads, downloads, and URL hint management
- Azure Health console for launching health scripts with live log streaming
- Folder uploads are guarded by ingestion keys to prevent untrusted intake
-
Handler Architecture
- Modular state/county/format handlers in
handlers/ - Handlers can delegate to county-level or format-level logic
- Shared logic and utilities for contest selection, table extraction, and output formatting
- Modular state/county/format handlers in
-
Election Integrity & Transparency
- ML/NER-based anomaly detection and cross-field validation
- Persistent logs and feedback loops for user corrections and audit trails
- Manual correction bot and retraining pipeline for continuous improvement
- All outputs are saved with rich metadata and context for reproducibility
-
Security & Compliance
- Path traversal and injection protections on all file/database operations
- .env-driven configuration for all sensitive settings
- Internal NLP/ML models replace external AI APIs for verification workflows
- No credentials or session tokens are stored; web UI can be secured for public deployment
- Single Source of Truth: All table extraction, harmonization, and feedback logic is centralized for maintainability and learning.
- Extensible & Pluggable: New extraction strategies, handlers, and ML models can be added without breaking the pipeline.
- Human-in-the-Loop: User feedback is integrated at every stage, from contest selection to table correction.
- Election Integrity First: Every step is logged, auditable, and designed to surface anomalies or suspicious data.
- Web & CLI Parity: All features are available via both the command line and the web interface.
- Multi-Strategy Table Extraction: HTML tables, repeated DOM, pattern-based, ML/NLP, plugin, and fallback NLP extraction.
- Dynamic Scoring & Patching: Extraction strategies are scored (ML/NER + heuristics); missing info is patched from other strategies when possible.
- Persistent Context Library: Learns from user corrections and feedback for smarter future extraction.
- Contest & Handler Routing: Dynamic state/county/format handler routing with fuzzy matching and context enrichment.
- Election Integrity Checks: ML/NER anomaly detection, cross-field validation, and audit logs.
- Web UI: Real-time log streaming, data management, and user-friendly contest/table review.
- Batch & Parallel Processing: Multiprocessing support for large-scale scraping.
- Security: Path safety, .env config, and no credential storage.
- Optimized PDF Parsing: pdf2image acceleration when Poppler is installed (automatic fallback to PyMuPDF otherwise).
-
Headless or GUI Mode: Browser launches headlessly by default unless CAPTCHA triggers a human interaction.
-
CAPTCHA-Resilient: Dynamically detects and pauses for Cloudflare verification with a visible browser.
-
Race-Year Detection: Scans HTML to find available election years and contests.
-
State-Aware Routing: Automatically detects state context and delegates to appropriate handler module.
-
Format-Aware Fallback: Supports CSV, JSON, PDF, and HTML formats with pluggable handlers.
-
Output Sorting: Results saved in nested folders by state, county, and race.
-
URL Selection: Loads URLs from
urls.txtand lets users select specific targets. -
.env Driven: Easily override behavior such as CAPTCHA timeouts or headless preferences.
-
Web UI Ready: All user prompts are modular for future web interface integration.
The Smart Elections Parser can be used in two ways:
-
Standalone Python Script:
- Run
html_election_parser.pydirectly from your IDE or terminal for full CLI control. - No web server required.
- Run
-
Web UI (Optional):
- A modern Flask-based web interface is included for users who prefer a graphical experience or are new to coding.
- Key Features of the Web UI:
- Dashboard: Overview of the parser and quick access to all tools.
- URL Hint Manager: Add, edit, import/export, and validate custom URL-to-handler mappings.
- Change History: View and restore previous configurations for transparency and auditability.
- Run Parser: Trigger the parser from the browser and view real-time output in a styled terminal-like area.
- Live Feedback: See parser logs as they happen (via WebSockets).
- Azure Health Control Center: Queue manual correction, retraining, and log-cleanup scripts with live stdout streaming.
- Accessible: Designed for both technical and non-technical users, making it ideal for teams, researchers, and those learning to code.
- How to Use the Web UI:
- Install requirements:
pip install -r requirements.txt- Python 3.12 (Windows) tested combo:
pip install -r requirements.txt -c constraints/local-py312.txtpython -m spacy download en_core_web_sm
- Python 3.12 (Windows) tested combo:
- Set up your
.envfile (or set environment variables in your shell or IDE launch configuration):- Required variables include:
FLASK_SECRET_KEYPOSTGRES_USERPOSTGRES_PASSWORDPOSTGRES_DBPOSTGRES_HOSTPOSTGRES_PORTDATA_API_URLCSP_MODE
- You can copy
.env.templateto.envand fill in your values. - For local development:
-
Install python-dotenv to automatically load variables from
.env:pip install python-dotenv
-
Note:
python-dotenvis not included inrequirements.txtand is not needed in production or on Azure.
-
- Required variables include:
- Start the web server:
python -m webapp.Smart_Elections_Parser_Webapp - Open your browser to
http://localhost:5000
- Install requirements:
- Note:
If you runpython -m webapp.Smart_Elections_Parser_Webappdirectly, you must ensure all required environment variables are set, or the app will not start. - The web UI is optional—all core parser features remain available via the CLI.
Before running the web server, you must set the required environment variables. You can do this in several ways:
Option 1: Use a .env file (recommended for local development):
-
Copy
.env.templateto.envand fill in your values. -
Install
python-dotenvlocally:pip install python-dotenv
-
The app will automatically load variables from
.envif present.
Option 2: Set environment variables in your shell before running (Windows Command Prompt):
set FLASK_SECRET_KEY=yourkey
set POSTGRES_USER=postgres
set POSTGRES_PASSWORD=yourpassword
set POSTGRES_DB=warehouse_election_results
set POSTGRES_HOST=localhost
set POSTGRES_PORT=5432
set DATA_API_URL=/api/warehouse_election_results
set CSP_MODE=STRICT
python -m webapp.Smart_Elections_Parser_WebappOr, on Linux/macOS:
export FLASK_SECRET_KEY=yourkey
export POSTGRES_USER=postgres
export POSTGRES_PASSWORD=yourpassword
export POSTGRES_DB=warehouse_election_results
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export DATA_API_URL=/api/warehouse_election_results
export CSP_MODE=STRICT
python -m webapp.Smart_Elections_Parser_WebappAlternatively, you can set these variables in your IDE launch configuration.
Production (Azure):
- Set environment variables in the Azure App Service configuration panel.
- Do not include
.envorpython-dotenvin your production deployment.
-
State/County Handler:
- Create a new handler in
handlers/states/orhandlers/counties/. - Implement a
parse(page, html_context)function. - Register your handler in
state_router.py.
- Create a new handler in
-
Custom Noisy Labels/Patterns:
- In your handler, pass
noisy_labelsandnoisy_label_patternstoselect_contest()for contest filtering.
- In your handler, pass
-
Format Handler:
- Add your handler to
utils/format_router.pyand register it inroute_format_handler.
- Add your handler to
-
User Prompts:
-
Use
prompt_user_input()for all user input to allow easy web UI integration later. -
Example:
python from utils.user_prompt import prompt_user_input url = prompt_user_input("Enter URL: ")
-
``
html_Parser_prototype/
├── webapp/
│ ├── Smart_Elections_Parser_Webapp.py # Flask web UI
│ ├── parser/
│ │ ├── html_election_parser.py # Main CLI orchestrator
│ │ ├── state_router.py # Dynamic handler routing
│ │ ├── utils/
│ │ │ ├── table_core.py # Centralized table extraction/learning
│ │ │ ├── dynamic_table_extractor.py # Candidate table generator/scorer
│ │ │ ├── ml_table_detector.py # ML/NLP table detection
│ │ │ ├── shared_logger.py # Logging utilities
│ │ │ ├── user_prompt.py # CLI/web prompt utilities
│ │ │ └── ... # (browser, captcha, etc.)
│ │ ├── Context_Integration/
│ │ │ ├── context_coordinator.py # Context/NLP/ML orchestrator
│ │ │ ├── context_organizer.py # Context enrichment, clustering, DB
│ │ │ └── Integrity_check.py # Election integrity/anomaly checks
│ │ ├── handlers/
│ │ │ ├── states/ # State/county handlers
│ │ │ ├── formats/ # Format handlers (csv, pdf, json, html)
│ │ │ └── shared/ # Shared handler logic
│ │ ├── templates/ # Web UI templates
│ │ ├── input/ # Input data
│ │ ├── output/ # Output data
│ │ ├── log/ # Logs
│ │ ├── .env
│ │ ├── .env.template
│ │ └── requirements.txt
```bash
---
## 🧪 How to Use
### Install Requirements
pip install -r requirements.txt
python -m spacy download en_core_web_sm
### 🤖 Run Automated Scripts
For comprehensive automation including pipeline audits, health checks, web asset validation, and testing:
```bash
python automate.py # Run all automated tasksOptions:
--skip-web: Skip web asset checks (JS/CSS/HTML linting)--skip-health: Skip health bots and integrity checks--skip-tests: Skip automated tests--skip-webapp-check: Skip webapp import validation
Note: When running on localhost (default), warnings are automatically suppressed for cleaner output. This includes deprecation warnings, future warnings, and pending deprecation warnings. The system detects localhost by checking if POSTGRES_HOST is set to localhost or 127.0.0.1, or if FLASK_ENV is set to development.
This central script ensures the project stays healthy and up-to-date.
- Windows (local development):
-
Download the latest Poppler build from https://github.com/oschwartz10612/poppler-windows/releases and unzip it (for example to
C:\poppler). -
Set the environment variable so the parser can find the binaries:
setx POPPLER_PATH "C:\\poppler\\Library\\bin" -
Restart any running parser/webapp processes so the change takes effect.
-
- Linux / Azure (production):
-
Install Poppler utilities during provisioning or container build:
sudo apt-get update sudo apt-get install -y poppler-utils
-
The handler automatically detects
pdftoppm/pdftocairoonce they are on PATH.
-
- Verification: rerun a PDF-heavy sample (for example the Minnesota 2016 PDF) and confirm the logs no longer emit
pdf2image conversion failedmessages.
- Populate
urls.txtwith target election result URLs. state_router.pywhen dynamic state detection fails.
python -m webapp.Smart_Elections_Parser_Webapp
- <Same as above with "" and folder path>
- \cd ...full path...\html_Parser_prototype\
- Then visit http://localhost:5000 in your browser or more likely the printed to terminal IP address pasted into browser of choice. This script activates the postgreSQL database so it must be ran first.
python -m webapp.parser.html_election_parser
- `(uncomment the "")
- ``
- if terminal already in root folder; otherwise, (replace full path with the actual path to the folder) "cd ...full path...\html_Parser_prototype"
All parsed results are saved in a structured, transparent, and auditable format:
output/{state}/{county}/{race}/{contest}_results.csv
Example:
output/arizona/maricopa/us_senate/kari_lake_results.csv
For each contest, the following files are generated:
-
CSV Results:
Tabular results for the contest, ready for analysis. -
Metadata JSON:
Includes key information such as:statecountyyearracecontesthandlertimestamp- Additional extraction context
-
Audit Trail:
A detailed log of extraction steps, harmonization, user corrections, and any anomalies detected, ensuring full transparency and reproducibility.
Add New Extraction Strategies: Implement in table_core.py or as a plugin. Add Handlers: Place new state/county/format handlers in handlers/. Election Integrity: All new logic should log decisions and support auditability.
- Add New States: Create a new file in
handlers/states/(e.g.georgia.py) and implement aparse()method. - Add Format Support: Add new file in
handlers/formats/and map informat_router.py. - Shared Behavior: Use
utils/shared_logic.pyfor common race detection, total extraction, etc.
- Headless Scraping: All scraping runs headlessly by default; a visible browser is launched only if CAPTCHA is triggered.
- .env Protection: Sensitive settings are managed via
.env, which is excluded from version control (.gitignore). - No Credential Storage: No credentials or session tokens are stored at any time.
- Path Safety: All file and database operations are path-safe and
.env-configured to prevent injection or traversal attacks. - Web UI Security: The web interface can be protected with authentication when deployed publicly.
- Auditability: All user feedback and corrections are logged for transparency and audit trails.
- Election Integrity: ML/NER-powered anomaly detection, cross-field validation, and persistent logs enforce data integrity.
- Multi-race selection prompt
- Retry logic for failed URLs
- Browser fingerprint obfuscation
- Contributor upload queue (for handler patches)
- YAML config option for handler metadata
- Web UI for user prompts and batch management
Smart Elections Parser is built to set a new standard for election data integrity and transparency. Every extraction, correction, and output is:
- Auditable: Full logs and metadata for every step.
- Verifiable: ML/NER-powered anomaly detection and structure validation.
- Correctable: Human-in-the-loop feedback at every stage.
- Extensible: Ready for new formats, handlers, and AI/ML improvements.
- Secure: Designed for safe, compliant, and transparent operation.
MIT License (TBD)
- Lead Dev: [Juancarlos Barragan]
- Elections Research: TBD
- Format Extraction: TBD