A coffee database system for Rotterdam's specialty coffee roasters with web scraping, NLP extraction, and analytics.
This project scrapes coffee data from local Rotterdam roasters and provides:
- Multi-roaster analytics across the coffee scene
- Sustainability scoring based on transparency and trade practices
- Flavor matching using NLP
- Price tracking and market analysis
- Quality indicators from championships and certifications
Working scrapers: 7 roasters (Schot, Ripsnorter, Manhattan, Giraffe, Spicekix, Manmetbril, Grounded) Data extracted: 72 coffee beans across 14 countries Price range: €4.50 - €235.44 (commercial to ultra-premium) Architecture: Modular structure with clean separation
# Install dependencies
poetry install
# or
pip install -r requirements.txt
# Show existing data
poetry run python -m src.cli.main
# Update data
poetry run python -m src.cli.main --update
# View scoring results
poetry run python -m src.cli.main --scores-onlycoffee-database/
├── src/ # Source code
│ ├── cli/ # Command-line interface
│ ├── models/ # Data models (Bean, Roaster)
│ ├── scrapers/ # Web scrapers for 7 roasters
│ ├── extract/ # NLP extraction pipeline
│ ├── scoring/ # Scoring algorithms
│ └── db/ # Data persistence
├── tests/ # Test suite
├── data/ # JSON data files and reports
├── docs/ # Documentation
└── scripts/ # Migration and utility scripts
- Modular design: Each roaster has its own scraper class
- Base class: Abstract
BaseScraperfor consistency - Multiple strategies: DOM parsing, regex, hybrid approaches
- Rate limiting: Respectful scraping with delays
- Smart origin detection: Countries, regions, and farms
- Flavor analysis: Categorized tasting notes with intensity
- Processing classification: Natural, washed, anaerobic, etc.
- Quality indicators: Championships, certifications, specialty scores
- Repository pattern: Clean data persistence
- Pydantic models: Type-safe data validation
- CLI interface: Easy command-line usage
- Comprehensive tests: Unit and integration testing
# Run all tests
poetry run pytest tests/ -v
# Test specific components
poetry run pytest tests/unit/test_scrapers/ -v
poetry run pytest tests/unit/test_models/ -v
# Integration tests
poetry run pytest tests/integration/ -v- Create a new scraper class extending
BaseScraper - Implement
get_roaster_data()andextract_bean_info() - Add to the scraper registry
- Write tests
The codebase is set up for easy extension:
- Clean separation of concerns
- Type hints throughout
- Comprehensive test coverage
- Well-documented APIs
Check docs/ for detailed guides on database migration, deployment, and feature roadmaps.
- Database integration (PostgreSQL/SQLite)
- Web interface dashboard
- Additional roaster support
- Real-time update scheduling
- Machine learning recommendations
MIT License - see LICENSE file for details.
Status: Production ready with 6/7 scrapers working Last Updated: January 2025