An AI-enhanced application for automated data preparation, estimation and report writing for statistical survey data.
This application streamlines the survey data processing workflow by providing:
- Automated Data Cleaning: Missing value imputation, outlier detection, rule-based validation
- Weight Application: Design weights and weighted/unweighted summaries with smart detection and validation
- Advanced Weighting Dashboard: Interactive impact analysis showing before/after comparisons
- Report Generation: Standardized PDF/HTML reports with visualizations
- π€ AI-Powered Reporting: Comprehensive analysis reports using Google Gemini AI
- User-Friendly Interface: Streamlit-based web interface with guided workflow
statsathon/
βββ src/
β βββ data_processing/ # Data loading, cleaning, and validation
β β βββ data_loader.py # CSV/Excel file handling
β β βββ data_cleaner.py # Missing values, outliers, rules
β β βββ schema_mapper.py # Column mapping and types
β βββ analysis/ # Statistical analysis and weights
β β βββ weight_calculator.py # Weighted statistics
β β βββ estimator.py # Population estimation
β βββ reporting/ # Report generation
β β βββ report_generator.py # PDF/HTML reports
β β βββ templates/ # Report templates
β βββ utils/ # Common utilities
β βββ config.py # Configuration
β βββ validators.py # Data validation
βββ frontend/ # Streamlit interface
β βββ streamlit_app.py # Main web application
βββ tests/ # Unit tests
βββ data/ # Sample datasets
β βββ sample_survey_data.csv
β βββ create_sample_data.py
βββ requirements.txt # Python dependencies
βββ run_app.py # Startup script
βββ README.md # This file
- Python 3.8 or higher
- pip package manager
-
Clone or download the project
cd statsathon -
Install dependencies
pip install -r requirements.txt
-
Run the application
python run_app.py
Or directly with Streamlit:
streamlit run frontend/streamlit_app.py
-
Open your browser
- Navigate to
http://localhost:8501 - The application will start automatically
- Navigate to
- File Upload: Drag-and-drop CSV/Excel files (up to 100MB)
- Automatic Schema Detection: Identifies column types and roles
- Data Preview: Interactive data exploration
- Validation: Comprehensive data quality checks
- Hybrid Approach: Choose between Guided Mode and Manual Workshop
- Guided Mode (π€): AI-powered automatic suggestions for quick setup
- Intelligent analysis of data quality issues
- One-click application of recommended cleaning rules
- Perfect for rapid prototyping and standard workflows
- Manual Workshop (π§): Full control for data scientists and power users
- Column-by-column manual configuration
- Chainable operations and custom pipelines
- Advanced parameter tuning for each method
- Real-time preview of effects
- Missing Value Imputation:
- Mean, median, mode imputation
- KNN imputation for complex patterns
- Forward/backward fill methods
- Outlier Detection:
- IQR method (Interquartile Range) with customizable multipliers
- Z-score and Modified Z-score with adjustable thresholds
- Isolation Forest for multivariate outliers
- Percentile-based detection
- Outlier Treatment:
- Flag outliers for review
- Remove outliers from dataset
- Winsorization with custom percentiles
- Capping at specified bounds
- Data Validation:
- Custom business rules
- Range validations
- Consistency checks
- Survey Weight Support: Handles complex survey designs
- Weighted Statistics: Mean, total, proportion calculations
- Design Effect: Measures impact of survey design
- Effective Sample Size: Accounts for weighting effects
- Parameter Estimation: Population totals, means, proportions
- Confidence Intervals: Multiple confidence levels (90%, 95%, 99%)
- Variance Estimation:
- Simple random sampling
- Taylor linearization
- Jackknife resampling
- Bootstrap methods
- Stratified Analysis: Support for stratified samples
- Automated Reports: PDF and HTML formats
- Interactive Visualizations: Charts and graphs
- Comprehensive Summaries: All analysis results
- Customizable Templates: Flexible report layouts
- Google Gemini Integration: Advanced LLM-powered report generation
- Comprehensive Analysis: Every detail analyzed and explained
- Multiple Report Depths: From executive summaries to academic research level
- Smart Insights: AI identifies patterns, correlations, and actionable insights
- Professional Formatting: Publication-ready reports in multiple formats
- Interactive Refinement: Edit and refine reports with AI assistance
- Multi-Format Export: Markdown, HTML, and text formats
- Customizable Focus: Tailored analysis for different survey types
- Smart Weight Detection: AI-powered automatic weight column identification
- Comprehensive Validation: Real-time weight quality assessment
- Impact Dashboard: Visual before/after comparisons showing weighting effects
- Distribution Analysis: Interactive charts showing how weights reshape data
- Expert Controls: Weight trimming, post-stratification, and calibration tools
- Quality Metrics: Design effect, effective sample size, and efficiency calculations
- Navigate to "Data Upload" page
- Upload your CSV or Excel file
- Review file information and data preview
- Check validation results
- Go to "Schema Configuration"
- Click "Analyze Schema" to auto-detect column types
- Review and modify column classifications:
- Numeric: Continuous variables
- Categorical: Discrete categories
- Weight: Survey weights
- ID: Identifier columns
-
Navigate to "Data Cleaning"
-
Choose Your Approach:
Option A: Guided Mode (π€) - Recommended for beginners
- Click "Suggest Cleaning Rules" for automatic recommendations
- Review AI-generated suggestions with explanations
- Select which rules to apply with checkboxes
- Apply selected cleaning operations with one click
Option B: Manual Workshop (π§) - For data scientists & power users
- Step 1: Select a column to work with
- Step 2: Choose operation type:
- Handle Missing Values (mean, median, KNN, etc.)
- Detect & Handle Outliers (IQR, Z-score, etc.)
- Transform Data (coming soon)
- Validate Data (range checks, custom conditions)
- Custom Rules (advanced)
- Step 3: Configure method-specific parameters:
- IQR multiplier (e.g., 1.5, 2.0, 3.0)
- Z-score threshold (e.g., 2.0, 3.0, 4.0)
- KNN neighbors count
- Outlier treatment action (flag, remove, winsorize, cap)
- Step 4: Add to pipeline and repeat for other columns
- Step 5: Execute entire pipeline or modify as needed
-
Compare Methods: In Manual Workshop, you can:
- Try IQR vs Z-score on the same column
- Compare different imputation methods
- Build complex multi-step pipelines
- Save and reuse cleaning workflows
- Go to "Weight Analysis"
- Select your weight column
- Choose variables to analyze
- Select statistics to calculate (count, mean, std, etc.)
- Run the analysis to see weighted vs unweighted results
- Navigate to "Population Estimation"
- Configure estimation parameters:
- Select weight column
- Choose variables for estimation
- Select estimation types (mean, total, proportion)
- Choose variance estimation method
- Run estimation to get confidence intervals
- Go to "Report Generation"
- Configure report settings:
- Enter report title
- Choose format (PDF or HTML)
- Select sections to include
- Generate and download your report
- Navigate to "π€ AI Report (Gemini)"
- Configure Gemini AI settings:
- Enter your Google Gemini API key
- Select model (gemini-1.5-pro recommended)
- Configure report parameters:
- Set report title and analysis focus
- Choose depth level (Executive to Academic Research)
- Select target audience and language style
- Choose sections to include
- Click "Generate Comprehensive AI Report"
- Review, edit, and download in multiple formats
AI Report Features:
- Automatic Analysis: AI analyzes every aspect of your data
- Professional Insights: Identifies patterns and correlations
- Actionable Recommendations: Specific next steps and improvements
- Multiple Formats: Download as Markdown, HTML, or text
- Interactive Refinement: Ask AI to focus on specific areas
- Publication Ready: Professional formatting for reports and presentations
The application includes sample datasets for testing:
sample_survey_data.csv: Realistic survey data with demographics, satisfaction scores, and survey weightsproblematic_test_data.csv: Dataset with various data quality issues for testing cleaning features
To generate new sample data:
cd data
python create_sample_data.pyThe application uses modular configuration in src/utils/config.py. Key settings include:
- File Limits: Maximum file size (100MB), supported formats
- Cleaning Methods: Available imputation and outlier detection methods
- Statistical Parameters: Confidence levels, variance methods
- Report Options: Output formats, template settings
- Streamlit: Web interface framework
- Pandas: Data manipulation and analysis
- NumPy: Numerical computations
- SciPy: Statistical functions
- Scikit-learn: Machine learning algorithms
- Plotly: Interactive visualizations
- ReportLab: PDF generation
- Jinja2: HTML templating
- Modular Design: Separate modules for data processing, analysis, and reporting
- Configuration Management: Centralized settings and parameters
- Error Handling: Comprehensive validation and error reporting
- Memory Efficient: Optimized for large datasets
Run tests using pytest:
pytest tests/- Unit tests for all core modules
- Integration tests for workflow
- Sample data validation
- Error handling verification
- File Size: Handles files up to 100MB
- Rows: Tested with datasets up to 1M rows
- Memory: Efficient memory usage with chunked processing
- Speed: Optimized algorithms for large datasets
- Very large files (>100MB) may require memory optimization
- Complex survey designs may need custom variance estimation
- PDF reports limited to essential tables for performance
This is an MVP version. Future enhancements may include:
- Advanced Survey Design: Complex stratification and clustering
- Dashboard Interface: Real-time analytics dashboard
- API Integration: REST API for programmatic access
- Cloud Deployment: Docker containers and cloud hosting
- Audit Trail: Complete workflow logging and versioning
- Export Options: Additional output formats (Excel, JSON)
- Fork the repository
- Install development dependencies:
pip install -r requirements.txt pip install pytest black flake8
- Run tests:
pytest tests/ - Format code:
black src/ - Check style:
flake8 src/
This project is developed for statistical survey data processing applications.
Common Issues:
-
Import Errors (
ModuleNotFoundError: No module named 'src'):- Solution: Make sure you're running the application from the project root directory
- Alternative: Use the startup script:
python run_app.py - Manual fix: Ensure the project structure is intact and all
__init__.pyfiles exist
-
Missing Dependencies (
ImportErrororModuleNotFoundError):- Solution: Install all dependencies
pip install -r requirements.txt
-
File Upload Issues: Check file size (<100MB) and format (CSV/Excel)
-
Memory Errors: For large files, try data cleaning to reduce size first
-
Weight Validation Errors: Ensure weight column contains positive numeric values
-
DataFrame Creation Errors: Usually resolved by updating pandas or restarting the application
- Check the error messages in the Streamlit interface
- Review the validation warnings for data quality issues
- Use the sample data to test functionality
- Check the browser console for JavaScript errors
- Use CSV format for better performance with large files
- Clean data early to reduce processing time
- Limit visualizations for very large datasets
- Use appropriate sample sizes for bootstrap methods
Built for Survey Data Processing Excellence π
This application accelerates survey data analysis while ensuring methodological rigor and reproducibility.