Skip to content

mseijse01/coffee-database

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Coffee Database - Rotterdam Coffee Scene

A coffee database system for Rotterdam's specialty coffee roasters with web scraping, NLP extraction, and analytics.

What it does

This project scrapes coffee data from local Rotterdam roasters and provides:

  • Multi-roaster analytics across the coffee scene
  • Sustainability scoring based on transparency and trade practices
  • Flavor matching using NLP
  • Price tracking and market analysis
  • Quality indicators from championships and certifications

Current Status

Working scrapers: 7 roasters (Schot, Ripsnorter, Manhattan, Giraffe, Spicekix, Manmetbril, Grounded) Data extracted: 72 coffee beans across 14 countries Price range: €4.50 - €235.44 (commercial to ultra-premium) Architecture: Modular structure with clean separation

Quick Start

# Install dependencies
poetry install
# or
pip install -r requirements.txt

# Show existing data
poetry run python -m src.cli.main

# Update data
poetry run python -m src.cli.main --update

# View scoring results
poetry run python -m src.cli.main --scores-only

Project Structure

coffee-database/
├── src/                      # Source code
│   ├── cli/                  # Command-line interface
│   ├── models/               # Data models (Bean, Roaster)
│   ├── scrapers/             # Web scrapers for 7 roasters
│   ├── extract/              # NLP extraction pipeline
│   ├── scoring/              # Scoring algorithms
│   └── db/                   # Data persistence
├── tests/                    # Test suite
├── data/                     # JSON data files and reports
├── docs/                     # Documentation
└── scripts/                  # Migration and utility scripts

Key Features

Scrapers

  • Modular design: Each roaster has its own scraper class
  • Base class: Abstract BaseScraper for consistency
  • Multiple strategies: DOM parsing, regex, hybrid approaches
  • Rate limiting: Respectful scraping with delays

Data Extraction

  • Smart origin detection: Countries, regions, and farms
  • Flavor analysis: Categorized tasting notes with intensity
  • Processing classification: Natural, washed, anaerobic, etc.
  • Quality indicators: Championships, certifications, specialty scores

Architecture

  • Repository pattern: Clean data persistence
  • Pydantic models: Type-safe data validation
  • CLI interface: Easy command-line usage
  • Comprehensive tests: Unit and integration testing

Testing

# Run all tests
poetry run pytest tests/ -v

# Test specific components
poetry run pytest tests/unit/test_scrapers/ -v
poetry run pytest tests/unit/test_models/ -v

# Integration tests
poetry run pytest tests/integration/ -v

Adding New Roasters

  1. Create a new scraper class extending BaseScraper
  2. Implement get_roaster_data() and extract_bean_info()
  3. Add to the scraper registry
  4. Write tests

Development

The codebase is set up for easy extension:

  • Clean separation of concerns
  • Type hints throughout
  • Comprehensive test coverage
  • Well-documented APIs

Check docs/ for detailed guides on database migration, deployment, and feature roadmaps.

Next Steps

  • Database integration (PostgreSQL/SQLite)
  • Web interface dashboard
  • Additional roaster support
  • Real-time update scheduling
  • Machine learning recommendations

License

MIT License - see LICENSE file for details.


Status: Production ready with 6/7 scrapers working Last Updated: January 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors