A comprehensive text analytics and predictive modeling framework for analyzing consumer coffee reviews. This project implements the methodology described in the thesis "Leveraging Text Analytics and Predictive Modeling to Analyze Consumer Coffee Reviews: A Data-Driven Approach" by Marcelo Seijas, Erasmus University Rotterdam.
Phase 2.2 Complete - Ready for scaling or advanced research
Latest Results: XGBoost R²=0.9453, Ridge R²=0.9259
Thesis Compliance: 100% methodology alignment achieved
Quick Start: python validate_15_percent_methodology.py (4-minute validation)
This study investigates the key sensory and non-sensory attributes that drive consumer preferences by analyzing coffee reviews from CoffeeReview.com using a combination of text analytics, sentiment analysis, Multinomial Inverse Regression (MNIR), and machine learning.
- What are the key factors that influence coffee ratings?
- How do text-based features compare to traditional sensory attributes?
- Can advanced NLP techniques improve rating prediction accuracy?
- What insights can topic modeling reveal about coffee review themes?
This implementation follows the exact methodology described in the thesis:
"A diverse set of features, including flavor attributes, categorical variables such as country of origin and roast level, and text-based features derived from BERT embeddings, GloVe vectors, and LDA topics, were used to predict coffee ratings."
- Polars-First Approach: Efficient data processing with lazy evaluation
- Hybrid Compatibility: Seamless conversion to Pandas when needed for sklearn
- Performance Optimization: Leverages Polars' memory efficiency
- TF-IDF Vectorization: 200 features per desc column (600 total) with unigrams, bigrams, and trigrams
- BERT Embeddings: 768-dimensional semantic representations using DistilBERT (2304 total)
- GloVe Embeddings: 300-dimensional pre-trained word vectors (900 total)
- Topic Modeling: LDA and NMF for thematic analysis (30 total topics)
- Sentiment Analysis: DistilBERT-based positive/negative sentiment scoring (6 total)
- LASSO Feature Selection: 279 selected from 3,840 text features (92.7% reduction)
- XGBoost: Best performance (R²=0.9453) with hyperparameter optimization
- Ridge Regression: Excellent performance (R²=0.9259)
- LASSO Regression: Strong performance (R²=0.8897)
- Random Forest: Good ensemble performance (R²=0.8675)
- Linear Regression: Baseline model (R²=0.8173)
- MNIR: Multinomial Inverse Regression for text-sensory correlation analysis
- MLflow Integration: Comprehensive experiment tracking and model registry
- Optuna Optimization: TPE algorithm with intelligent pruning (5-10x speedup)
- SHAP Analysis: Automated feature importance and model interpretation
- Performance Metrics: MAE, RMSE, R² with statistical validation
Current Validation Results (15% Sample):
Model R² RMSE MAE Status
--------------------------------------------------
XGBoost 0.9453 0.4103 0.2152 Best
Ridge 0.9259 0.4775 0.3801 Excellent
LASSO 0.8897 0.5825 0.4623 Strong
Random Forest 0.8675 0.6386 0.3590 Good
Linear 0.8173 0.7497 0.6101 Baseline
MNIR Analysis Results:
Sensory Attribute R² MSE Performance
--------------------------------------------------
Acidity 0.9389 0.5343 Excellent
Aftertaste 0.8420 0.0375 Strong
Body 0.7966 0.0508 Good
Aroma 0.7834 0.0816 Good
Flavor 0.5789 0.0433 Moderate
Text features dominate: BERT embeddings and TF-IDF features were found to be the most predictive of coffee ratings, followed by sentiment scores and topic features.
Topic analysis reveals: Distinct themes like origin characteristics, processing methods, flavor profiles, and brewing recommendations.
Sentiment-rating correlation: Strong relationship between sentiment and ratings - positive sentiment correlates with higher ratings (8.5+), while negative sentiment associates with lower ratings (<7.0).
- Python 3.8+
- 8GB+ RAM (for BERT embeddings)
- CUDA-compatible GPU (optional, for faster processing)
- Clone the repository:
git clone <repository-url>
cd coffee-text-analytics- Create virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Verify installation:
python run_tests.py # Run test suiteFastest way to validate thesis methodology (4 minutes):
python validate_15_percent_methodology.pyThis gives you complete thesis validation with XGBoost R²=0.9453, all models + MNIR analysis.
- Current Achievement: 100% thesis methodology compliance
- Best Model: XGBoost (R²=0.9453)
- Feature Selection: 279 selected from 3,840 text features (92.7% reduction)
- MNIR Performance: Acidity R²=0.9389, Body R²=0.7966
- Infrastructure: MLflow + Optuna + SHAP analysis ready
python validate_15_percent_methodology.py --sample_size=50 # 50% sample
python validate_15_percent_methodology.py --sample_size=100 # Full dataset python validate_15_percent_methodology.py --mode=full --trials=50 # Research-grade optimizationmlflow ui --port 5000 # View results in browserRun the full methodology:
python main.py --steps allpython main.py --steps preprocessCleans text columns, extracts country info, standardizes prices.
python main.py --steps featuresRuns TF-IDF, BERT, sentiment, and topic extraction.
python main.py --steps trainTrains all models with hyperparameter optimization and SHAP analysis.
python main.py --steps visualizeCreates performance comparisons and feature importance charts.
python main.py --models xgboost random_forest mnir --steps trainpython main.py --text_columns desc_1 desc_2 desc_3 --steps featuresCOFFEE_ENV=production python main.py --steps all # Production settings
COFFEE_ENV=testing python main.py --steps all # Testing settingsThe dataset from CoffeeReview.com includes:
desc_1: Primary review descriptiondesc_2: Secondary review notesdesc_3: Additional tasting notes
rating: Coffee rating score (0-100 scale)
origin: Coffee origin/countryroast: Roast level (light, medium, dark)roaster: Coffee roasting company
est_price: Estimated price per poundaroma: Aroma score (0-10)acid: Acidity score (0-10)body: Body/mouthfeel score (0-10)flavor: Flavor score (0-10)aftertaste: Aftertaste score (0-10)
# Example using the new architecture
from src.features.feature_manager import CoffeeFeatureManager
# Initialize feature manager
feature_manager = CoffeeFeatureManager({
'extractors': ['tfidf', 'sentiment', 'topic']
})
# Extract features
feature_manager.fit(training_texts)
features_df = feature_manager.extract_all_features(
df=coffee_data, # Polars DataFrame
text_columns=['desc_1', 'desc_2', 'desc_3']
)Per text column:
- TF-IDF Features: 5,000 dimensions
- BERT Embeddings: 768 dimensions
- Sentiment Features: 2 dimensions
- Topic Features: 20 dimensions (10 LDA + 10 NMF)
Total Features per Text Column: ~5,790 dimensions
Total for 3 Text Columns: ~17,370 text-based features
- Polars
>=0.15.0: Modern DataFrame library - Pandas
>=1.4.0: Compatibility layer for sklearn - NumPy
>=1.20.0: Numerical computing
- Scikit-learn
>=1.0.0: ML algorithms and preprocessing - XGBoost
>=1.5.0: Gradient boosting (best model) - Transformers
>=4.18.0: BERT embeddings - PyTorch
>=1.11.0: Deep learning backend - Gensim
>=4.1.0: Topic modeling - NLTK
>=3.7.0: Text preprocessing
- MLflow
>=2.0.0: Experiment tracking - Optuna
>=3.0.0: Hyperparameter optimization - SHAP
>=0.40.0: Model interpretation
- Plotly
>=5.0.0: Interactive visualizations - Matplotlib
>=3.5.0: Basic plotting - Seaborn
>=0.11.0: Statistical visualizations
python run_tests.pyTest Coverage:
- 96.7% pass rate - Robust and stable codebase
- Data processing tests - Polars/Pandas integration validated
- Integration tests - End-to-end pipeline validation
- Performance tests - Memory and speed optimization
- Thesis compliance validation - 15% methodology validator proven
- Component-based design - Modular, extensible architecture
- Zero import conflicts - Clean dependency management
- MLflow + Optuna integration - Enhanced experiment tracking
- Professional documentation - Academic-grade documentation
- Thesis methodology compliance - 100% validation achieved
The pipeline always trains new models and overwrites existing models in the models/ directory. Before running on new data:
# Option 1: Clear models directory
rm -rf models/*.pkl
# Option 2: Backup existing models
mkdir models_backup_$(date +%Y%m%d)
cp models/*.pkl models_backup_$(date +%Y%m%d)/
# Then run pipeline
python main.py --steps allmodels/tfidf_vectorizer.pkl- TF-IDF vocabularymodels/lda_model.pkl- LDA topic modelmodels/nmf_model.pkl- NMF topic modelmodels/linear_model.pkl- Linear regressionmodels/random_forest_model.pkl- Random forestmodels/xgboost_model.pkl- XGBoostmodels/mnir_model.pkl- MNIR model
- CURRENT_STATUS.md: Up-to-date project status and achievements
- STRATEGIC_IMPLEMENTATION_PLAN.md: Master plan with progress
- docs/thesis.md: Complete thesis document
- docs/findings.md: Research findings and insights
- docs/methodology.md: Detailed research methodology
- docs/archive/: Complete development history for reference
The clean_outputs.py utility helps ensure fresh pipeline runs:
# Interactive mode with confirmation
python clean_outputs.py
# Clean all without confirmation
python clean_outputs.py --confirm
# Preview what would be deleted
python clean_outputs.py --dry-run
# Choose specific directories
python clean_outputs.py --selectiveWe welcome contributions that extend the thesis methodology:
- Additional embedding models (RoBERTa, ELECTRA)
- Advanced topic modeling (BERTopic, Top2Vec)
- Cross-domain validation studies
- Temporal analysis of review trends
- GPU acceleration for BERT processing
- Distributed processing capabilities
- Real-time inference pipeline
- Web interface for exploration
This project is part of an academic thesis and is provided for educational and research purposes.
If you use this code or methodology in your research, please cite:
@mastersthesis{seijas2024coffee,
title={Leveraging Text Analytics and Predictive Modeling to Analyze Consumer Coffee Reviews: A Data-Driven Approach},
author={Seijas, Marcelo},
year={2024},
school={Erasmus University Rotterdam},
department={Erasmus School of Economics},
program={Data Science and Marketing Analytics},
supervisor={O'Neill, Eoghan},
secondassessor={Brüggemann, Sean}
}Thesis Supervisor: Eoghan O'Neill
Second Assessor: Sean Brüggemann
Institution: Erasmus University Rotterdam, Erasmus School of Economics
Program: Data Science and Marketing Analytics
Year: 2024