Kaggle Data Cleaning Challenge

Overview

This repository contains Jupyter notebooks for the Kaggle Data Cleaning Challenge, a hands-on project focused on improving data quality through various preprocessing techniques. The challenge consists of five key tasks:

Handling Missing Values – Identifying and addressing missing data issues.
Access Notebook or Kaggle
Parsing Dates – Standardizing date formats for consistency.
Access Notebook or Kaggle
Scaling and Normalizing Data – Preparing numerical data for analysis by applying appropriate transformations.
Access Notebook or Kaggle
Handling Character Encodings – Resolving text encoding issues to ensure data integrity.
Access Notebook on Kaggle
Fixing Inconsistent Data Entry – Standardizing categorical values to remove inconsistencies.
Access Notebook or Kaggle
Full Data Cleaning Workflow – Integrating all previous cleaning steps into a complete, reproducible workflow.
Access Notebook
(This notebook also provides the dataset for the final task, nb6)

These notebooks demonstrate practical data cleaning techniques using Python and pandas within a structured Jupyter Lab environment.

Highlight: Data Cleaning in nb6 - Fixing Inconsistent Data Entry

In the nb6-data-cleaning-challenge-full-cleaning.ipynb notebook, I demonstrate my ability to clean data by addressing issues of inconsistent data entry, which is a common challenge in real-world datasets. Specifically, I focus on:

Identifying Inconsistencies: I analyse the dataset to find discrepancies such as different formats for categorical variables or variations in text encoding.
Standardising Categories: I ensure consistency by converting categorical data to a uniform format. For example, I may merge different spellings of the same value or standardise abbreviations.
Handling Invalid Entries: Any rows with erroneous data (such as invalid or impossible values) are removed or replaced with appropriate substitutes.
Creating Cleaned Datasets: After addressing inconsistencies, I generate a cleaned dataset that is ready for further analysis or modelling.

This task demonstrates my ability to recognise and resolve issues with inconsistent data, ensuring that the dataset is reliable and ready for analysis. The notebook showcases my proficiency in using pandas for data manipulation and cleaning techniques.

Project Structure

Kaggle-Data_Cleaning_Challenge/
│── notebooks/
│   │── nb1-data-cleaning-challenge-handling-missing-values.ipynb
│   │── nb2-data-cleaning-challenge-scale-and-normalize-data.ipynb
│   │── nb3-data-cleaning-challenge-parsing-dates.ipynb
│   │── nb4-data-cleaning-challenge-character-encodings.ipynb
│   │── nb5-data-cleaning-challenge-inconsistent-data-entry.ipynb
|   |── nb6-data-cleaning-challenge-full-cleaning.ipyb
│── data/      # Scripts and utilities
│   │── d0-raw/
│   │── d1-interim/
│   │── d2-clean/
│── kaggle_cleaning/      # Scripts and utilities
│   │── config.py
│   │── data.py
│   │── utils.py
│   │── __init__.py
│── scripts/archive/      # Archived scripts
│── LICENSE               # License file
│── README.md             # Project documentation
│── setup.py              # Project setup script
│── .gitignore            # Files and directories to ignore in version control

Setup Instructions

1. Clone the Repository

git clone https://github.com/jgp-13/kaggle-data-cleaning-challenge
cd Kaggle-Data_Cleaning_Challenge

2. Set Up the Environment

If using Conda, create and activate the environment:

conda env create -f environment.yml
conda activate kaggle-cleaning

Otherwise, install dependencies manually using pip:

pip install -r requirements.txt

3. Download the Datasets

Visit the Kaggle websites to download the datasets used in this challenge:
Place the raw datasets in the data/d0-raw/ directory.

4. Launch Jupyter Lab

jupyter lab

Dependencies

The required dependencies can be installed via pip or Conda. They include:

pandas
numpy
matplotlib
seaborn
jupyterlab

Ensure your environment is correctly set up to run the notebooks.

Usage

Open each notebook under the notebooks/ directory.
Follow the markdown instructions and run the provided code cells.
Modify or experiment with the code to further understand data cleaning techniques.

Contributing

Feel free to fork this repository, make improvements, and submit a pull request.

License

This project is open-source and available under the MIT License.

For questions or feedback, reach out via GitHub Issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kaggle Data Cleaning Challenge

Overview

Highlight: Data Cleaning in nb6 - Fixing Inconsistent Data Entry

Project Structure

Setup Instructions

1. Clone the Repository

2. Set Up the Environment

3. Download the Datasets

4. Launch Jupyter Lab

Dependencies

Usage

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
kaggle_cleaning		kaggle_cleaning
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.txt		environment.txt
environment.yml		environment.yml
setup.py		setup.py

License

jgp-13/kaggle-data-cleaning-challenge

Folders and files

Latest commit

History

Repository files navigation

Kaggle Data Cleaning Challenge

Overview

Highlight: Data Cleaning in nb6 - Fixing Inconsistent Data Entry

Project Structure

Setup Instructions

1. Clone the Repository

2. Set Up the Environment

3. Download the Datasets

4. Launch Jupyter Lab

Dependencies

Usage

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages