Coffee Database - Rotterdam Coffee Scene

A coffee database system for Rotterdam's specialty coffee roasters with web scraping, NLP extraction, and analytics.

What it does

This project scrapes coffee data from local Rotterdam roasters and provides:

Multi-roaster analytics across the coffee scene
Sustainability scoring based on transparency and trade practices
Flavor matching using NLP
Price tracking and market analysis
Quality indicators from championships and certifications

Current Status

Working scrapers: 7 roasters (Schot, Ripsnorter, Manhattan, Giraffe, Spicekix, Manmetbril, Grounded) Data extracted: 72 coffee beans across 14 countries Price range: €4.50 - €235.44 (commercial to ultra-premium) Architecture: Modular structure with clean separation

Quick Start

# Install dependencies
poetry install
# or
pip install -r requirements.txt

# Show existing data
poetry run python -m src.cli.main

# Update data
poetry run python -m src.cli.main --update

# View scoring results
poetry run python -m src.cli.main --scores-only

Project Structure

coffee-database/
├── src/                      # Source code
│   ├── cli/                  # Command-line interface
│   ├── models/               # Data models (Bean, Roaster)
│   ├── scrapers/             # Web scrapers for 7 roasters
│   ├── extract/              # NLP extraction pipeline
│   ├── scoring/              # Scoring algorithms
│   └── db/                   # Data persistence
├── tests/                    # Test suite
├── data/                     # JSON data files and reports
├── docs/                     # Documentation
└── scripts/                  # Migration and utility scripts

Key Features

Scrapers

Modular design: Each roaster has its own scraper class
Base class: Abstract BaseScraper for consistency
Multiple strategies: DOM parsing, regex, hybrid approaches
Rate limiting: Respectful scraping with delays

Data Extraction

Smart origin detection: Countries, regions, and farms
Flavor analysis: Categorized tasting notes with intensity
Processing classification: Natural, washed, anaerobic, etc.
Quality indicators: Championships, certifications, specialty scores

Architecture

Repository pattern: Clean data persistence
Pydantic models: Type-safe data validation
CLI interface: Easy command-line usage
Comprehensive tests: Unit and integration testing

Testing

# Run all tests
poetry run pytest tests/ -v

# Test specific components
poetry run pytest tests/unit/test_scrapers/ -v
poetry run pytest tests/unit/test_models/ -v

# Integration tests
poetry run pytest tests/integration/ -v

Adding New Roasters

Create a new scraper class extending BaseScraper
Implement get_roaster_data() and extract_bean_info()
Add to the scraper registry
Write tests

Development

The codebase is set up for easy extension:

Clean separation of concerns
Type hints throughout
Comprehensive test coverage
Well-documented APIs

Check docs/ for detailed guides on database migration, deployment, and feature roadmaps.

Next Steps

Database integration (PostgreSQL/SQLite)
Web interface dashboard
Additional roaster support
Real-time update scheduling
Machine learning recommendations

License

MIT License - see LICENSE file for details.

Status: Production ready with 6/7 scrapers working Last Updated: January 2025

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
data		data
docs		docs
scripts		scripts
src		src
tests		tests
.DS_Store		.DS_Store
.coverage		.coverage
PROJECT_PLAN.md		PROJECT_PLAN.md
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ripsnorter_extraction_report.json		ripsnorter_extraction_report.json
ripsnorter_extraction_report_20250706_044540.json		ripsnorter_extraction_report_20250706_044540.json
ripsnorter_extraction_report_20250706_044831.json		ripsnorter_extraction_report_20250706_044831.json
ripsnorter_extraction_report_20250706_050030.json		ripsnorter_extraction_report_20250706_050030.json
ripsnorter_extraction_report_20250706_050204.json		ripsnorter_extraction_report_20250706_050204.json
ripsnorter_extraction_report_20250706_050844.json		ripsnorter_extraction_report_20250706_050844.json
ripsnorter_extraction_report_20250706_051829.json		ripsnorter_extraction_report_20250706_051829.json
schot_extraction_report.json		schot_extraction_report.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coffee Database - Rotterdam Coffee Scene

What it does

Current Status

Quick Start

Project Structure

Key Features

Scrapers

Data Extraction

Architecture

Testing

Adding New Roasters

Development

Next Steps

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Coffee Database - Rotterdam Coffee Scene

What it does

Current Status

Quick Start

Project Structure

Key Features

Scrapers

Data Extraction

Architecture

Testing

Adding New Roasters

Development

Next Steps

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages