Turning raw data into insight, one notebook at a time.
Welcome to my personal Data Science learning and project hub. This repository documents my journey through data analysis, machine learning, statistical modeling, and real-world problem-solving with data.
Whether you're here to learn, collaborate, or explore — make yourself at home.
Data-Science/
│
├── 📊 EDA/ # Exploratory Data Analysis notebooks
│ ├── titanic_eda.ipynb
│ ├── world_happiness_eda.ipynb
│ └── retail_sales_eda.ipynb
│
├── 🤖 Machine-Learning/ # Supervised & unsupervised ML projects
│ ├── house_price_prediction/
│ ├── customer_churn/
│ └── spam_classifier/
│
├── 📈 Visualization/ # Charts, dashboards, and storytelling
│ ├── matplotlib_showcase.ipynb
│ └── plotly_interactive.ipynb
│
├── 🧹 Data-Cleaning/ # Messy data → clean data pipelines
│ └── cleaning_pipeline.ipynb
│
├── 📝 Notes/ # Study notes & reference sheets
│ ├── statistics_101.md
│ ├── pandas_cheatsheet.md
│ └── sklearn_reference.md
│
└── README.md
⚠️ This structure is a roadmap — projects are added progressively.
Goal: Predict housing prices using regression models.
Tools: pandas, scikit-learn, matplotlib, seaborn
Highlights:
- Feature engineering on 80+ columns
- Compared Linear Regression, Ridge, Lasso, and XGBoost
- Final RMSE: ~18,000 (top 15% Kaggle score)
Goal: Identify customers likely to cancel their subscription.
Tools: pandas, sklearn, imbalanced-learn, SHAP
Highlights:
- Handled severe class imbalance with SMOTE
- Random Forest + SHAP for explainability
- Precision: 87% | Recall: 82%
Goal: Deep dive into factors driving happiness across nations.
Tools: pandas, plotly, seaborn, statsmodels
Highlights:
- Correlation heatmaps and regression analysis
- Interactive choropleth world map
- Insight: GDP per capita explains ~63% of happiness variance
Goal: Binary classification of emails as spam or not spam.
Tools: sklearn, NLTK, TF-IDF, Naive Bayes
Highlights:
- Full NLP pipeline: tokenization → vectorization → classification
- Accuracy: 98.4% on test set
- False positive rate kept below 1%
| Category | Tools |
|---|---|
| Languages | Python 3.x |
| Data Manipulation | pandas, NumPy |
| Visualization | matplotlib, seaborn, Plotly |
| Machine Learning | scikit-learn, XGBoost, LightGBM |
| NLP | NLTK, spaCy, TF-IDF |
| Statistics | statsmodels, SciPy |
| Notebooks | Jupyter, Google Colab |
| Version Control | Git, GitHub |
Here's the roadmap I'm following to level up:
- Python fundamentals & OOP
- NumPy & pandas for data manipulation
- Data visualization with matplotlib & seaborn
- Exploratory Data Analysis (EDA)
- Statistics: distributions, hypothesis testing, regression
- Supervised learning (regression + classification)
- Unsupervised learning (clustering, dimensionality reduction)
- Model evaluation, tuning, and deployment
- Deep Learning with TensorFlow / PyTorch
- MLOps & model monitoring
Clone the repo and install dependencies:
git clone https://github.com/CreepyLewis/Data-Science.git
cd Data-Science
pip install -r requirements.txtOpen any notebook with Jupyter:
jupyter notebookOr open directly in Google Colab by clicking the badge at the top of each notebook.
numpy
pandas
matplotlib
seaborn
plotly
scikit-learn
xgboost
lightgbm
nltk
spacy
statsmodels
scipy
imbalanced-learn
shap
jupyter
This is a personal portfolio repo, but PRs and issues are very welcome!
- 🐛 Found a bug in a notebook? Open an issue.
- 💡 Have a dataset or project idea? Drop it in Discussions.
- 🌟 Liked the work? Give it a star — it helps a lot!
| Platform | Link |
|---|---|
| GitHub | @CreepyLewis |
| coming soon | |
| coming soon |
This project is licensed under the MIT License — feel free to use, remix, and build on it.
Made with 💻, ☕, and a lot of .head() calls.
⭐ Star this repo if you find it useful! ⭐