Skip to content

soumyacodes007/statsathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Survey Data Processing Application

An AI-enhanced application for automated data preparation, estimation and report writing for statistical survey data.

🎯 Project Overview

This application streamlines the survey data processing workflow by providing:

  • Automated Data Cleaning: Missing value imputation, outlier detection, rule-based validation
  • Weight Application: Design weights and weighted/unweighted summaries with smart detection and validation
  • Advanced Weighting Dashboard: Interactive impact analysis showing before/after comparisons
  • Report Generation: Standardized PDF/HTML reports with visualizations
  • πŸ€– AI-Powered Reporting: Comprehensive analysis reports using Google Gemini AI
  • User-Friendly Interface: Streamlit-based web interface with guided workflow

πŸ—οΈ Project Structure

statsathon/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data_processing/     # Data loading, cleaning, and validation
β”‚   β”‚   β”œβ”€β”€ data_loader.py      # CSV/Excel file handling
β”‚   β”‚   β”œβ”€β”€ data_cleaner.py     # Missing values, outliers, rules
β”‚   β”‚   └── schema_mapper.py    # Column mapping and types
β”‚   β”œβ”€β”€ analysis/           # Statistical analysis and weights
β”‚   β”‚   β”œβ”€β”€ weight_calculator.py # Weighted statistics
β”‚   β”‚   └── estimator.py        # Population estimation
β”‚   β”œβ”€β”€ reporting/          # Report generation
β”‚   β”‚   β”œβ”€β”€ report_generator.py # PDF/HTML reports
β”‚   β”‚   └── templates/          # Report templates
β”‚   └── utils/             # Common utilities
β”‚       β”œβ”€β”€ config.py          # Configuration
β”‚       └── validators.py      # Data validation
β”œβ”€β”€ frontend/              # Streamlit interface
β”‚   └── streamlit_app.py      # Main web application
β”œβ”€β”€ tests/                # Unit tests
β”œβ”€β”€ data/                 # Sample datasets
β”‚   β”œβ”€β”€ sample_survey_data.csv
β”‚   └── create_sample_data.py
β”œβ”€β”€ requirements.txt      # Python dependencies
β”œβ”€β”€ run_app.py           # Startup script
└── README.md            # This file

πŸš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Installation

  1. Clone or download the project

    cd statsathon
  2. Install dependencies

    pip install -r requirements.txt
  3. Run the application

    python run_app.py

    Or directly with Streamlit:

    streamlit run frontend/streamlit_app.py
  4. Open your browser

    • Navigate to http://localhost:8501
    • The application will start automatically

πŸ“Š Features

1. Data Input & Configuration

  • File Upload: Drag-and-drop CSV/Excel files (up to 100MB)
  • Automatic Schema Detection: Identifies column types and roles
  • Data Preview: Interactive data exploration
  • Validation: Comprehensive data quality checks

2. Data Cleaning

  • Hybrid Approach: Choose between Guided Mode and Manual Workshop
  • Guided Mode (πŸ€–): AI-powered automatic suggestions for quick setup
    • Intelligent analysis of data quality issues
    • One-click application of recommended cleaning rules
    • Perfect for rapid prototyping and standard workflows
  • Manual Workshop (πŸ”§): Full control for data scientists and power users
    • Column-by-column manual configuration
    • Chainable operations and custom pipelines
    • Advanced parameter tuning for each method
    • Real-time preview of effects
  • Missing Value Imputation:
    • Mean, median, mode imputation
    • KNN imputation for complex patterns
    • Forward/backward fill methods
  • Outlier Detection:
    • IQR method (Interquartile Range) with customizable multipliers
    • Z-score and Modified Z-score with adjustable thresholds
    • Isolation Forest for multivariate outliers
    • Percentile-based detection
  • Outlier Treatment:
    • Flag outliers for review
    • Remove outliers from dataset
    • Winsorization with custom percentiles
    • Capping at specified bounds
  • Data Validation:
    • Custom business rules
    • Range validations
    • Consistency checks

3. Weight Application

  • Survey Weight Support: Handles complex survey designs
  • Weighted Statistics: Mean, total, proportion calculations
  • Design Effect: Measures impact of survey design
  • Effective Sample Size: Accounts for weighting effects

4. Population Estimation

  • Parameter Estimation: Population totals, means, proportions
  • Confidence Intervals: Multiple confidence levels (90%, 95%, 99%)
  • Variance Estimation:
    • Simple random sampling
    • Taylor linearization
    • Jackknife resampling
    • Bootstrap methods
  • Stratified Analysis: Support for stratified samples

5. Report Generation

  • Automated Reports: PDF and HTML formats
  • Interactive Visualizations: Charts and graphs
  • Comprehensive Summaries: All analysis results
  • Customizable Templates: Flexible report layouts

6. πŸ€– AI-Powered Reporting (NEW!)

  • Google Gemini Integration: Advanced LLM-powered report generation
  • Comprehensive Analysis: Every detail analyzed and explained
  • Multiple Report Depths: From executive summaries to academic research level
  • Smart Insights: AI identifies patterns, correlations, and actionable insights
  • Professional Formatting: Publication-ready reports in multiple formats
  • Interactive Refinement: Edit and refine reports with AI assistance
  • Multi-Format Export: Markdown, HTML, and text formats
  • Customizable Focus: Tailored analysis for different survey types

7. Advanced Weighting Features (NEW!)

  • Smart Weight Detection: AI-powered automatic weight column identification
  • Comprehensive Validation: Real-time weight quality assessment
  • Impact Dashboard: Visual before/after comparisons showing weighting effects
  • Distribution Analysis: Interactive charts showing how weights reshape data
  • Expert Controls: Weight trimming, post-stratification, and calibration tools
  • Quality Metrics: Design effect, effective sample size, and efficiency calculations

πŸ“‹ Usage Guide

Step 1: Data Upload

  1. Navigate to "Data Upload" page
  2. Upload your CSV or Excel file
  3. Review file information and data preview
  4. Check validation results

Step 2: Schema Configuration

  1. Go to "Schema Configuration"
  2. Click "Analyze Schema" to auto-detect column types
  3. Review and modify column classifications:
    • Numeric: Continuous variables
    • Categorical: Discrete categories
    • Weight: Survey weights
    • ID: Identifier columns

Step 3: Data Cleaning (Hybrid Approach)

  1. Navigate to "Data Cleaning"

  2. Choose Your Approach:

    Option A: Guided Mode (πŸ€–) - Recommended for beginners

    • Click "Suggest Cleaning Rules" for automatic recommendations
    • Review AI-generated suggestions with explanations
    • Select which rules to apply with checkboxes
    • Apply selected cleaning operations with one click

    Option B: Manual Workshop (πŸ”§) - For data scientists & power users

    • Step 1: Select a column to work with
    • Step 2: Choose operation type:
      • Handle Missing Values (mean, median, KNN, etc.)
      • Detect & Handle Outliers (IQR, Z-score, etc.)
      • Transform Data (coming soon)
      • Validate Data (range checks, custom conditions)
      • Custom Rules (advanced)
    • Step 3: Configure method-specific parameters:
      • IQR multiplier (e.g., 1.5, 2.0, 3.0)
      • Z-score threshold (e.g., 2.0, 3.0, 4.0)
      • KNN neighbors count
      • Outlier treatment action (flag, remove, winsorize, cap)
    • Step 4: Add to pipeline and repeat for other columns
    • Step 5: Execute entire pipeline or modify as needed
  3. Compare Methods: In Manual Workshop, you can:

    • Try IQR vs Z-score on the same column
    • Compare different imputation methods
    • Build complex multi-step pipelines
    • Save and reuse cleaning workflows

Step 4: Weight Analysis

  1. Go to "Weight Analysis"
  2. Select your weight column
  3. Choose variables to analyze
  4. Select statistics to calculate (count, mean, std, etc.)
  5. Run the analysis to see weighted vs unweighted results

Step 5: Population Estimation

  1. Navigate to "Population Estimation"
  2. Configure estimation parameters:
    • Select weight column
    • Choose variables for estimation
    • Select estimation types (mean, total, proportion)
    • Choose variance estimation method
  3. Run estimation to get confidence intervals

Step 6: Report Generation

  1. Go to "Report Generation"
  2. Configure report settings:
    • Enter report title
    • Choose format (PDF or HTML)
    • Select sections to include
  3. Generate and download your report

Step 7: πŸ€– AI-Powered Comprehensive Reports (NEW!)

  1. Navigate to "πŸ€– AI Report (Gemini)"
  2. Configure Gemini AI settings:
    • Enter your Google Gemini API key
    • Select model (gemini-1.5-pro recommended)
  3. Configure report parameters:
    • Set report title and analysis focus
    • Choose depth level (Executive to Academic Research)
    • Select target audience and language style
    • Choose sections to include
  4. Click "Generate Comprehensive AI Report"
  5. Review, edit, and download in multiple formats

AI Report Features:

  • Automatic Analysis: AI analyzes every aspect of your data
  • Professional Insights: Identifies patterns and correlations
  • Actionable Recommendations: Specific next steps and improvements
  • Multiple Formats: Download as Markdown, HTML, or text
  • Interactive Refinement: Ask AI to focus on specific areas
  • Publication Ready: Professional formatting for reports and presentations

πŸ§ͺ Sample Data

The application includes sample datasets for testing:

  • sample_survey_data.csv: Realistic survey data with demographics, satisfaction scores, and survey weights
  • problematic_test_data.csv: Dataset with various data quality issues for testing cleaning features

To generate new sample data:

cd data
python create_sample_data.py

βš™οΈ Configuration

The application uses modular configuration in src/utils/config.py. Key settings include:

  • File Limits: Maximum file size (100MB), supported formats
  • Cleaning Methods: Available imputation and outlier detection methods
  • Statistical Parameters: Confidence levels, variance methods
  • Report Options: Output formats, template settings

πŸ”§ Technical Details

Core Dependencies

  • Streamlit: Web interface framework
  • Pandas: Data manipulation and analysis
  • NumPy: Numerical computations
  • SciPy: Statistical functions
  • Scikit-learn: Machine learning algorithms
  • Plotly: Interactive visualizations
  • ReportLab: PDF generation
  • Jinja2: HTML templating

Architecture

  • Modular Design: Separate modules for data processing, analysis, and reporting
  • Configuration Management: Centralized settings and parameters
  • Error Handling: Comprehensive validation and error reporting
  • Memory Efficient: Optimized for large datasets

πŸ§ͺ Testing

Run tests using pytest:

pytest tests/

Test Coverage

  • Unit tests for all core modules
  • Integration tests for workflow
  • Sample data validation
  • Error handling verification

πŸ“ˆ Performance

Scalability

  • File Size: Handles files up to 100MB
  • Rows: Tested with datasets up to 1M rows
  • Memory: Efficient memory usage with chunked processing
  • Speed: Optimized algorithms for large datasets

Limitations

  • Very large files (>100MB) may require memory optimization
  • Complex survey designs may need custom variance estimation
  • PDF reports limited to essential tables for performance

🀝 Contributing

This is an MVP version. Future enhancements may include:

Planned Features

  • Advanced Survey Design: Complex stratification and clustering
  • Dashboard Interface: Real-time analytics dashboard
  • API Integration: REST API for programmatic access
  • Cloud Deployment: Docker containers and cloud hosting
  • Audit Trail: Complete workflow logging and versioning
  • Export Options: Additional output formats (Excel, JSON)

Development Setup

  1. Fork the repository
  2. Install development dependencies:
    pip install -r requirements.txt
    pip install pytest black flake8
  3. Run tests: pytest tests/
  4. Format code: black src/
  5. Check style: flake8 src/

πŸ“ License

This project is developed for statistical survey data processing applications.

πŸ†˜ Support

Troubleshooting

Common Issues:

  1. Import Errors (ModuleNotFoundError: No module named 'src'):

    • Solution: Make sure you're running the application from the project root directory
    • Alternative: Use the startup script: python run_app.py
    • Manual fix: Ensure the project structure is intact and all __init__.py files exist
  2. Missing Dependencies (ImportError or ModuleNotFoundError):

    • Solution: Install all dependencies
    pip install -r requirements.txt
  3. File Upload Issues: Check file size (<100MB) and format (CSV/Excel)

  4. Memory Errors: For large files, try data cleaning to reduce size first

  5. Weight Validation Errors: Ensure weight column contains positive numeric values

  6. DataFrame Creation Errors: Usually resolved by updating pandas or restarting the application

Getting Help

  • Check the error messages in the Streamlit interface
  • Review the validation warnings for data quality issues
  • Use the sample data to test functionality
  • Check the browser console for JavaScript errors

Performance Tips

  • Use CSV format for better performance with large files
  • Clean data early to reduce processing time
  • Limit visualizations for very large datasets
  • Use appropriate sample sizes for bootstrap methods

Built for Survey Data Processing Excellence πŸ“Š

This application accelerates survey data analysis while ensuring methodological rigor and reproducibility.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages