A flexible, extensible job scraping framework built with Python that can scrape job listings from multiple job boards. Currently supports Workable platform with an architecture designed for easy extension to other job platforms.
- Multi-Platform Support: Extensible architecture for adding new job board scrapers
- Smart Filtering: Configurable job filtering by keywords, location, job type, and company
- Database Storage: SQLite database for persistent job data storage
- Duplicate Detection: Automatic detection and prevention of duplicate job entries
- Respectful Scraping: Built-in delays and respectful scraping practices
- Factory Pattern: Dynamic scraper creation and registration system
- Orchestrated Scraping: Centralized job scraping orchestration with statistics
job-board-data/
├── core/ # Core framework components
│ ├── __init__.py
│ ├── base_scraper.py # Abstract base class for platform scrapers
│ ├── database.py # Database operations and management
│ ├── factory.py # Dynamic scraper factory
│ ├── filters.py # Job filtering logic
│ └── orchestrator.py # Main scraping orchestration
├── main.py # Main application entry point
├── workable_scraper.py # Workable platform scraper implementation
├── workable.py # Legacy Workable scraper
└── README.md
- Clone the repository:
git clone <repository-url>
cd job-board-data- Install required dependencies:
pip install selenium sqlite3- Install ChromeDriver for Selenium (required for web scraping):
- Download from ChromeDriver
- Add to your system PATH
from main import scrape_singl_platform
# Basic scraping example
search_params = {
'query': 'python developer',
'location': 'remote',
'company_url': 'https://company.workable.com'
}
# Scrape jobs from Workable
scrape_singl_platform('workable', search_params, max_pages=3)List available platforms:
python main.py listRun with default parameters:
python main.py# Define filter criteria
filter_params = {
'keywords': ['python', 'developer', 'engineer'],
'locations': ['remote', 'new york', 'san francisco'],
'job_types': ['full-time'],
'companies': ['tech', 'startup']
}
# Scrape with filters
scrape_singl_platform('workable', search_params, filter_params, max_pages=5)from main import scrape_multiple_platforms
platforms = ['workable'] # Add more platforms as they become available
scrape_multiple_platforms(platforms, search_params, filter_params)-
BasePlatformScraper (
core/base_scraper.py):- Abstract base class defining the scraper interface
- Methods for driver setup, page navigation, data extraction
-
JobScrapperOrchestrator (
core/orchestrator.py):- Coordinates the scraping process
- Handles filtering, duplicate detection, and database operations
- Provides scraping statistics and error handling
-
DatabaseManager (
core/database.py):- SQLite database operations
- Job storage and duplicate detection
- Database schema management
-
JobFilter (
core/filters.py):- Configurable job filtering system
- Filter by keywords, location, job type, salary, company
-
ScrapperFactory (
core/factory.py):- Dynamic scraper creation and registration
- Plugin-style architecture for adding new platforms
CREATE TABLE jobs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
platform TEXT,
job_title TEXT,
company TEXT,
location TEXT,
job_type TEXT,
salary TEXT,
description TEXT,
requirements TEXT,
post_date TEXT,
url TEXT,
company_logo TEXT,
scrapped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
raw_data TEXT
)- Scrapes job listings from Workable-powered career pages
- Supports detailed job information extraction
- Handles pagination and dynamic content loading
To add a new job board scraper:
- Create a new scraper class inheriting from
BasePlatformScraper:
from core.base_scraper import BasePlatformScraper
class NewPlatformScraper(BasePlatformScraper):
def __init__(self):
super().__init__("new_platform")
def setup_driver(self):
# Implementation for driver setup
pass
def get_job_listings_page(self, search_params):
# Implementation for navigation
pass
# Implement other required methods...- Register the scraper with the factory:
from core.factory import ScrapperFactory
ScrapperFactory.register_scraper("new_platform", NewPlatformScraper)query: Job search query/keywordslocation: Job location filtercompany_url: Platform-specific company URL (for Workable)
keywords: List of required keywords in title/descriptionlocations: List of acceptable job locationsjob_types: List of acceptable job types (full-time, part-time, etc.)companies: List of acceptable company namessalary_min: Minimum salary requirement
The scraper provides detailed statistics:
- Total jobs found
- Jobs filtered out
- Duplicate jobs detected
- Successfully scraped jobs
- Errors encountered
- Respectful Scraping: Built-in delays prevent overwhelming target servers
- Error Handling: Comprehensive error handling and logging
- Data Validation: Input validation and data sanitization
- Modular Design: Easy to extend and maintain
- Fork the repository
- Create a feature branch
- Implement your changes following the existing architecture
- Add tests for new functionality
- Submit a pull request
This project is for educational and legitimate job search purposes only. Please respect the terms of service of job board websites and use responsibly.
-
ChromeDriver not found:
- Ensure ChromeDriver is installed and in your PATH
- Check ChromeDriver version compatibility with your Chrome browser
-
Database errors:
- Check file permissions for database creation
- Ensure SQLite is properly installed
-
Scraping failures:
- Website structure may have changed
- Check for CAPTCHA or anti-bot measures
- Verify network connectivity
Enable debug output by modifying the scraper to include more verbose logging for troubleshooting specific issues.