Survey Data Processing Application

An AI-enhanced application for automated data preparation, estimation and report writing for statistical survey data.

🎯 Project Overview

This application streamlines the survey data processing workflow by providing:

Automated Data Cleaning: Missing value imputation, outlier detection, rule-based validation
Weight Application: Design weights and weighted/unweighted summaries with smart detection and validation
Advanced Weighting Dashboard: Interactive impact analysis showing before/after comparisons
Report Generation: Standardized PDF/HTML reports with visualizations
🤖 AI-Powered Reporting: Comprehensive analysis reports using Google Gemini AI
User-Friendly Interface: Streamlit-based web interface with guided workflow

🏗️ Project Structure

statsathon/
├── src/
│   ├── data_processing/     # Data loading, cleaning, and validation
│   │   ├── data_loader.py      # CSV/Excel file handling
│   │   ├── data_cleaner.py     # Missing values, outliers, rules
│   │   └── schema_mapper.py    # Column mapping and types
│   ├── analysis/           # Statistical analysis and weights
│   │   ├── weight_calculator.py # Weighted statistics
│   │   └── estimator.py        # Population estimation
│   ├── reporting/          # Report generation
│   │   ├── report_generator.py # PDF/HTML reports
│   │   └── templates/          # Report templates
│   └── utils/             # Common utilities
│       ├── config.py          # Configuration
│       └── validators.py      # Data validation
├── frontend/              # Streamlit interface
│   └── streamlit_app.py      # Main web application
├── tests/                # Unit tests
├── data/                 # Sample datasets
│   ├── sample_survey_data.csv
│   └── create_sample_data.py
├── requirements.txt      # Python dependencies
├── run_app.py           # Startup script
└── README.md            # This file

🚀 Quick Start

Prerequisites

Python 3.8 or higher
pip package manager

Installation

Clone or download the project
```
cd statsathon
```
Install dependencies
```
pip install -r requirements.txt
```

Run the application

python run_app.py

Or directly with Streamlit:

streamlit run frontend/streamlit_app.py

Open your browser
- Navigate to http://localhost:8501
- The application will start automatically

📊 Features

1. Data Input & Configuration

File Upload: Drag-and-drop CSV/Excel files (up to 100MB)
Automatic Schema Detection: Identifies column types and roles
Data Preview: Interactive data exploration
Validation: Comprehensive data quality checks

2. Data Cleaning

Hybrid Approach: Choose between Guided Mode and Manual Workshop
Guided Mode (🤖): AI-powered automatic suggestions for quick setup
- Intelligent analysis of data quality issues
- One-click application of recommended cleaning rules
- Perfect for rapid prototyping and standard workflows
Manual Workshop (🔧): Full control for data scientists and power users
- Column-by-column manual configuration
- Chainable operations and custom pipelines
- Advanced parameter tuning for each method
- Real-time preview of effects
Missing Value Imputation:
- Mean, median, mode imputation
- KNN imputation for complex patterns
- Forward/backward fill methods
Outlier Detection:
- IQR method (Interquartile Range) with customizable multipliers
- Z-score and Modified Z-score with adjustable thresholds
- Isolation Forest for multivariate outliers
- Percentile-based detection
Outlier Treatment:
- Flag outliers for review
- Remove outliers from dataset
- Winsorization with custom percentiles
- Capping at specified bounds
Data Validation:
- Custom business rules
- Range validations
- Consistency checks

3. Weight Application

Survey Weight Support: Handles complex survey designs
Weighted Statistics: Mean, total, proportion calculations
Design Effect: Measures impact of survey design
Effective Sample Size: Accounts for weighting effects

4. Population Estimation

Parameter Estimation: Population totals, means, proportions
Confidence Intervals: Multiple confidence levels (90%, 95%, 99%)
Variance Estimation:
- Simple random sampling
- Taylor linearization
- Jackknife resampling
- Bootstrap methods
Stratified Analysis: Support for stratified samples

5. Report Generation

Automated Reports: PDF and HTML formats
Interactive Visualizations: Charts and graphs
Comprehensive Summaries: All analysis results
Customizable Templates: Flexible report layouts

6. 🤖 AI-Powered Reporting (NEW!)

Google Gemini Integration: Advanced LLM-powered report generation
Comprehensive Analysis: Every detail analyzed and explained
Multiple Report Depths: From executive summaries to academic research level
Smart Insights: AI identifies patterns, correlations, and actionable insights
Professional Formatting: Publication-ready reports in multiple formats
Interactive Refinement: Edit and refine reports with AI assistance
Multi-Format Export: Markdown, HTML, and text formats
Customizable Focus: Tailored analysis for different survey types

7. Advanced Weighting Features (NEW!)

Smart Weight Detection: AI-powered automatic weight column identification
Comprehensive Validation: Real-time weight quality assessment
Impact Dashboard: Visual before/after comparisons showing weighting effects
Distribution Analysis: Interactive charts showing how weights reshape data
Expert Controls: Weight trimming, post-stratification, and calibration tools
Quality Metrics: Design effect, effective sample size, and efficiency calculations

📋 Usage Guide

Step 1: Data Upload

Navigate to "Data Upload" page
Upload your CSV or Excel file
Review file information and data preview
Check validation results

Step 2: Schema Configuration

Go to "Schema Configuration"
Click "Analyze Schema" to auto-detect column types
Review and modify column classifications:
- Numeric: Continuous variables
- Categorical: Discrete categories
- Weight: Survey weights
- ID: Identifier columns

Step 3: Data Cleaning (Hybrid Approach)

Navigate to "Data Cleaning"
Choose Your Approach:

Option A: Guided Mode (🤖) - Recommended for beginners
- Click "Suggest Cleaning Rules" for automatic recommendations
- Review AI-generated suggestions with explanations
- Select which rules to apply with checkboxes
- Apply selected cleaning operations with one click
Option B: Manual Workshop (🔧) - For data scientists & power users
- Step 1: Select a column to work with
- Step 2: Choose operation type:
  - Handle Missing Values (mean, median, KNN, etc.)
  - Detect & Handle Outliers (IQR, Z-score, etc.)
  - Transform Data (coming soon)
  - Validate Data (range checks, custom conditions)
  - Custom Rules (advanced)
- Step 3: Configure method-specific parameters:
  - IQR multiplier (e.g., 1.5, 2.0, 3.0)
  - Z-score threshold (e.g., 2.0, 3.0, 4.0)
  - KNN neighbors count
  - Outlier treatment action (flag, remove, winsorize, cap)
- Step 4: Add to pipeline and repeat for other columns
- Step 5: Execute entire pipeline or modify as needed
Compare Methods: In Manual Workshop, you can:
- Try IQR vs Z-score on the same column
- Compare different imputation methods
- Build complex multi-step pipelines
- Save and reuse cleaning workflows

Step 4: Weight Analysis

Go to "Weight Analysis"
Select your weight column
Choose variables to analyze
Select statistics to calculate (count, mean, std, etc.)
Run the analysis to see weighted vs unweighted results

Step 5: Population Estimation

Navigate to "Population Estimation"
Configure estimation parameters:
- Select weight column
- Choose variables for estimation
- Select estimation types (mean, total, proportion)
- Choose variance estimation method
Run estimation to get confidence intervals

Step 6: Report Generation

Go to "Report Generation"
Configure report settings:
- Enter report title
- Choose format (PDF or HTML)
- Select sections to include
Generate and download your report

Step 7: 🤖 AI-Powered Comprehensive Reports (NEW!)

Navigate to "🤖 AI Report (Gemini)"
Configure Gemini AI settings:
- Enter your Google Gemini API key
- Select model (gemini-1.5-pro recommended)
Configure report parameters:
- Set report title and analysis focus
- Choose depth level (Executive to Academic Research)
- Select target audience and language style
- Choose sections to include
Click "Generate Comprehensive AI Report"
Review, edit, and download in multiple formats

AI Report Features:

Automatic Analysis: AI analyzes every aspect of your data
Professional Insights: Identifies patterns and correlations
Actionable Recommendations: Specific next steps and improvements
Multiple Formats: Download as Markdown, HTML, or text
Interactive Refinement: Ask AI to focus on specific areas
Publication Ready: Professional formatting for reports and presentations

🧪 Sample Data

The application includes sample datasets for testing:

sample_survey_data.csv: Realistic survey data with demographics, satisfaction scores, and survey weights
problematic_test_data.csv: Dataset with various data quality issues for testing cleaning features

To generate new sample data:

cd data
python create_sample_data.py

⚙️ Configuration

The application uses modular configuration in src/utils/config.py. Key settings include:

File Limits: Maximum file size (100MB), supported formats
Cleaning Methods: Available imputation and outlier detection methods
Statistical Parameters: Confidence levels, variance methods
Report Options: Output formats, template settings

🔧 Technical Details

Core Dependencies

Streamlit: Web interface framework
Pandas: Data manipulation and analysis
NumPy: Numerical computations
SciPy: Statistical functions
Scikit-learn: Machine learning algorithms
Plotly: Interactive visualizations
ReportLab: PDF generation
Jinja2: HTML templating

Architecture

Modular Design: Separate modules for data processing, analysis, and reporting
Configuration Management: Centralized settings and parameters
Error Handling: Comprehensive validation and error reporting
Memory Efficient: Optimized for large datasets

🧪 Testing

Run tests using pytest:

pytest tests/

Test Coverage

Unit tests for all core modules
Integration tests for workflow
Sample data validation
Error handling verification

📈 Performance

Scalability

File Size: Handles files up to 100MB
Rows: Tested with datasets up to 1M rows
Memory: Efficient memory usage with chunked processing
Speed: Optimized algorithms for large datasets

Limitations

Very large files (>100MB) may require memory optimization
Complex survey designs may need custom variance estimation
PDF reports limited to essential tables for performance

🤝 Contributing

This is an MVP version. Future enhancements may include:

Planned Features

Advanced Survey Design: Complex stratification and clustering
Dashboard Interface: Real-time analytics dashboard
API Integration: REST API for programmatic access
Cloud Deployment: Docker containers and cloud hosting
Audit Trail: Complete workflow logging and versioning
Export Options: Additional output formats (Excel, JSON)

Development Setup

Fork the repository

Install development dependencies:

pip install -r requirements.txt
pip install pytest black flake8

Run tests: pytest tests/
Format code: black src/
Check style: flake8 src/

📝 License

This project is developed for statistical survey data processing applications.

🆘 Support

Troubleshooting

Common Issues:

Import Errors (ModuleNotFoundError: No module named 'src'):
- Solution: Make sure you're running the application from the project root directory
- Alternative: Use the startup script: python run_app.py
- Manual fix: Ensure the project structure is intact and all __init__.py files exist
Missing Dependencies (ImportError or ModuleNotFoundError):
- Solution: Install all dependencies
```
pip install -r requirements.txt
```
File Upload Issues: Check file size (<100MB) and format (CSV/Excel)
Memory Errors: For large files, try data cleaning to reduce size first
Weight Validation Errors: Ensure weight column contains positive numeric values
DataFrame Creation Errors: Usually resolved by updating pandas or restarting the application

Getting Help

Check the error messages in the Streamlit interface
Review the validation warnings for data quality issues
Use the sample data to test functionality
Check the browser console for JavaScript errors

Performance Tips

Use CSV format for better performance with large files
Clean data early to reduce processing time
Limit visualizations for very large datasets
Use appropriate sample sizes for bootstrap methods

Built for Survey Data Processing Excellence 📊

This application accelerates survey data analysis while ensuring methodological rigor and reproducibility.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
frontend		frontend
src		src
.gitignore		.gitignore
DATA_QUALITY_SOLUTION_SUMMARY.md		DATA_QUALITY_SOLUTION_SUMMARY.md
ERROR_FIXES_SUMMARY.md		ERROR_FIXES_SUMMARY.md
GEMINI_SETUP.md		GEMINI_SETUP.md
JSON_SERIALIZATION_FIX.md		JSON_SERIALIZATION_FIX.md
README.md		README.md
debug_duplicate_index.py		debug_duplicate_index.py
debug_internal_operations.py		debug_internal_operations.py
debug_streamlit_workflow.py		debug_streamlit_workflow.py
debug_weight_calculator_internal.py		debug_weight_calculator_internal.py
demo_advanced_validation.py		demo_advanced_validation.py
emergency_debug.py		emergency_debug.py
final_verification_test.py		final_verification_test.py
requirements.txt		requirements.txt
run_app.py		run_app.py
test_fixed_modules.py		test_fixed_modules.py
test_json_serialization.py		test_json_serialization.py
test_weighted_median_direct.py		test_weighted_median_direct.py

Folders and files

Latest commit

History

Repository files navigation

Survey Data Processing Application

🎯 Project Overview

🏗️ Project Structure

🚀 Quick Start

Prerequisites

Installation

📊 Features

1. Data Input & Configuration

2. Data Cleaning

3. Weight Application

4. Population Estimation

5. Report Generation

6. 🤖 AI-Powered Reporting (NEW!)

7. Advanced Weighting Features (NEW!)

📋 Usage Guide

Step 1: Data Upload

Step 2: Schema Configuration

Step 3: Data Cleaning (Hybrid Approach)

Step 4: Weight Analysis

Step 5: Population Estimation

Step 6: Report Generation

Step 7: 🤖 AI-Powered Comprehensive Reports (NEW!)

🧪 Sample Data

⚙️ Configuration

🔧 Technical Details

Core Dependencies

Architecture

🧪 Testing

Test Coverage

📈 Performance

Scalability

Limitations

🤝 Contributing

Planned Features

Development Setup

📝 License

🆘 Support

Troubleshooting

Getting Help

Performance Tips

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages