This project investigates food accessibility, nutritional quality, and socioeconomic disparities across counties in the state of Texas. It demonstrates robust data wrangling techniques by implementing a complete data pipeline across four technical platforms:
- Python (pandas)
- R (tidyverse)
- SQL
- Excel(PowerQuery).
The primary objective is to construct a reproducible data wrangling workflow that transforms messy, real-world datasets into clean, analysis-ready outputs suitable for investigating food access inequities.
- Urban vs Rural Disparities: How does food access differ between urban and rural counties in Texas?
- Poverty-Access Correlation: What is the relationship between poverty rates and limited food access?
- Demographic Disparities: Which demographic groups face the greatest challenges in accessing healthy food?
- Education-Access Relationship: How does educational attainment correlate with food accessibility?
- Nutritional Profiles: What are the nutritional characteristics of different food categories?
- Source: United States Department of Agriculture
- File:
FoodAccessResearchAtlasData.csv - Dimensions: 72,531 rows × 147 columns
- Description: Census-tract-level indicators of supermarket accessibility and food access challenges for different demographic groups
- Key Variables: Population, income, race, SNAP benefits, geographic accessibility measures
- Source: U.S. Census Bureau via CORGIS
- File:
county_demographics.csv - Dimensions: 3,139 rows × 43 columns
- Description: County-level demographic and socioeconomic data from 2010-2019
- Key Variables: Age distribution, education, employment, ethnicity, household income, housing
- Source: USDA Food Composition Database via CORGIS
- File:
food.csv - Dimensions: 7,083 rows × 38 columns
- Description: Nutritional breakdowns of thousands of food items
- Key Variables: Macronutrients, vitamins, minerals, food categories
Food_Access_Nutrition_Equity/
│
├── data/
│ ├── raw/ # Original data files
│ │ ├── FoodAccessResearchAtlasData.csv
│ │ ├── county_demographics.csv
│ │ └── food.csv
│ └── processed/ # Cleaned and merged datasets
│ ├── Texas_County_Merged.csv
│ ├── Texas_Tract_FoodAccess.csv
│ └── Food_Nutrition_Clean.csv
│
├── code/
│ ├── python/ # Python implementation
│ │ └── Food_Access_Nutrition_Equity.ipynb
│ ├── r/ # R implementation
│ │ └── food_access_wrangling.R
│ ├── sql/ # SQL implementation
│ │ └── food_access_queries.sql
│ └── excel/ # Excel documentation
│ └── excel_workflow.md
│
├── outputs/
│ └── visualizations/ # Analysis charts and graphs
│
├── docs/
│ ├── project_report.pdf # Final project report
│ └── data_dictionary.md # Variable descriptions
│
└── README.md # This file
- Load all three datasets
- Examine structure, dimensions, and data types
- Identify initial data quality issues
- Reduce USDA Atlas from 147 to ~20 columns
- Select relevant demographic variables
- Choose key nutritional indicators
- Handle missing values (NULL, -1, NaN)
- Standardize county and state names
- Clean food item descriptions
- Fix data type inconsistencies
- Convert census tracts to county-level aggregations
- Create derived variables (urban/rural flags, accessibility scores)
- Normalize nutritional units
- Categorize foods into groups
- Aggregate tract-level data to county level
- Calculate county-level statistics
- Summarize nutritional profiles by food category
- Join USDA food access with county demographics
- Match on standardized county names
- Handle unmatched records
- Create three analysis-ready datasets
- Document all transformations
- Export to appropriate formats
- Generate research question visualizations
- Create summary statistics
- Validate data integrity
- Environment: Jupyter Notebook
- Key Libraries: pandas, numpy, matplotlib, seaborn
- Approach: DataFrame operations using .loc[], boolean indexing, groupby aggregations
- Output Format: CSV files
- Environment: RStudio
- Key Packages: dplyr, tidyr, ggplot2, readr
- Approach: Pipe-based workflow with verb functions (filter, select, mutate, group_by)
- Output Format: CSV and RDS files
- Platform: SQLite or PostgreSQL
- Approach: CREATE TABLE statements, JOINs, aggregate functions, CTEs
- Output Format: Database tables and views
- Version: Excel 2016 or later
- Techniques: Power Query, VLOOKUP, Pivot Tables, Data Validation
- Output Format: XLSX workbooks with multiple sheets
- Wide Dataset Reduction: Reduced 228 total columns to 30 core columns (86.8% reduction)
- Geographic Mapping: Successfully mapped census tracts to counties for Texas
- Name Standardization: Resolved inconsistencies in county and state naming conventions
- Missing Data Handling: Implemented consistent rules for NULL, -1, and NaN values
- Food Categorization: Created logical groupings for 7,000+ food items
- Unit Harmonization: Documented and standardized nutritional measurements
- Final Datasets: 3 clean, analysis-ready datasets
- Texas Focus: 254 counties analyzed
- Data Quality: 100% of missing values handled appropriately
- Dimension Reduction: From 228 to 30 columns while preserving analytical value
- Visualizations: 5 research question charts generated
pip install pandas numpy matplotlib seaborn jupyterinstall.packages(c("tidyverse", "readr", "haven"))- SQLite 3.x or PostgreSQL 12+
- Microsoft Excel 2016+ with Power Query
-
Clone the repository
git clone [repository-url] cd Food_Access_Nutrition_Equity -
Place raw data files in
data/raw/directory -
Run implementation of choice:
- Python: Open and run
Food_Access_Nutrition_Equity.ipynb - R: Source
food_access_wrangling.R - SQL: Execute
food_access_queries.sql - Excel: Follow
excel_workflow.md
- Python: Open and run
- Weak Correlation: Poverty rates show surprisingly weak correlation (-0.073) with food access limitations in Texas counties
- Urban Advantage: Urban counties consistently show better food accessibility metrics
- Education Impact: Higher educational attainment strongly correlates with improved food access
- Nutritional Diversity: Significant variation in nutritional profiles across food categories
This project was completed as part of a data wrangling course, demonstrating proficiency in:
- Multiple technical platforms
- Real-world data integration
- Reproducible research practices
- Documentation and communication
cchitse(Jason C)
This project is for academic purposes. Data sources retain their original licensing.
- USDA for Food Access Research Atlas data
- CORGIS project for curated datasets
- Course instructors for methodology guidance
- USDA Economic Research Service. (2023). Food Access Research Atlas.
- CORGIS Dataset Project. County Demographics
- CORGIS Dataset Project. Food Nutrition
- Howisonlab tutorials for pandas methodology (howisonlab.github.io)
Last Updated: November 18, 2025