Skip to content

Income model, update documentation, examples, and add ARM build to pipeline#86

Open
yamilbknsu wants to merge 31 commits intodevfrom
yep/income_model
Open

Income model, update documentation, examples, and add ARM build to pipeline#86
yamilbknsu wants to merge 31 commits intodevfrom
yep/income_model

Conversation

@yamilbknsu
Copy link
Copy Markdown
Collaborator

This pull request introduces several important updates to the DEMOS demographic microsimulator, focusing on improved documentation, expanded Docker support, configuration and model enhancements, and codebase improvements. The changes make the project more accessible to new users, add support for ARM64 Docker builds, update model configurations and coefficients for greater clarity and accuracy, and introduce new model modules and utility functions.

Documentation and Usability Improvements:

  • The README.md has been significantly expanded to provide clearer setup instructions, including prerequisites, example data usage, and troubleshooting for Docker memory limits. Documentation links have been updated and a new contact section has been added. [1] [2] [3] [4]

Docker and Workflow Enhancements:

  • The Docker workflow now supports multi-platform builds (both linux/amd64 and linux/arm64), allowing the project to be run on a wider range of systems, including Apple Silicon Macs.

Configuration and Model Updates:

  • The TOML configuration files (demos_config.toml and demos_config_ref.toml) have been reordered and expanded to include new modules such as income, income_adjustment, and normalize_table_dtypes. Redundant or unused table definitions have been removed for clarity. [1] [2] [3] [4]
  • New and revised model coefficient files have been added or updated for modules such as birth, cohabitation, labor force participation, divorce, education, kids moving, marriage, mortality, and income. These updates standardize variable names and model expressions for consistency and clarity. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

Core Codebase Improvements:

  • The model initialization and module imports have been updated to include new modules such as normalize_table_dtypes and income, and to ensure all constants are available. [1] [2]
  • A new utility function hh_head_age has been added to demos/models/aging.py to efficiently retrieve the age of the household head, supporting new model requirements.
  • The birth model has been updated to use standardized household race column names, improving code clarity and reducing potential for errors.

@yamilbknsu yamilbknsu requested a review from Copilot April 24, 2026 17:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR expands DEMOS with a new income model + regression template, reorganizes/standardizes computed variables, and significantly improves documentation and Docker support (including ARM64 builds).

Changes:

  • Add an OLS regression template and an income model step with supporting household/person variables.
  • Refactor and relocate many Orca computed columns into model modules; refresh example calibrated coefficient YAMLs accordingly.
  • Expand Sphinx docs (calibration, variables reference, data sources) and update Docker publishing to build for linux/amd64 and linux/arm64.

Reviewed changes

Copilot reviewed 41 out of 43 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
scripts/clean_columns.py Adds a utility script to subset H5 columns and synthesize job fields.
pyproject.toml Bumps version, updates metadata/URLs, removes project script entry.
docs/source/pages/variables.md Adds a comprehensive variables reference page.
docs/source/pages/intro.md Expands Docker Compose onboarding instructions.
docs/source/pages/datasources.rst Adds detailed docs on CSV vs HDF5 table sources and config examples.
docs/source/pages/configuration.rst Links new calibration + variables pages into configuration docs.
docs/source/pages/calibration.md Adds detailed calibration documentation and examples.
docs/source/pages/advanced_configuration.md Adds explanation of lazy Orca columns, caching, and GEOID assignment.
docs/source/conf.py Updates org metadata and enables MyST dollarmath.
docker-compose.yml Updates default GHCR namespace for the image.
demos/variables.py Large refactor/cleanup of computed variables; introduces standardized naming.
demos/templates/estimated_models/regression_model.py Adds RegressionStep (OLS) estimated model template.
demos/templates/estimated_models/init.py Exposes RegressionStep from estimated_models package.
demos/templates/calibration/procedures.py Broadens calibration typing to TemplateStep and improves logging messages.
demos/models/marriage.py Changes matching sort key from earning to computed per-person household income.
demos/models/kids_moving.py Moves kids-move computed columns into kids_moving module.
demos/models/income.py Introduces income step and associated household columns for income regression.
demos/models/household_reorg.py Renames head-race columns and adds household aggregate computed columns.
demos/models/fatality.py Moves mortality age-bin computed columns into fatality module.
demos/models/employment.py Stops updating earnings in employment step; moves employment age bins into module.
demos/models/data_fix.py Adds normalize_table_dtypes step to reduce PyTables object-dtype warnings.
demos/models/constants.py Adds relational adjustment mapping table and state quartile labels.
demos/models/birth.py Updates head-race column names and adds birth-related household computed columns.
demos/models/aging.py Adds hh_head_age computed column (households).
demos/models/init.py Imports income + constants modules for initialization side effects/registration.
demos/config.py Adds normalize_table_dtypes to default module ordering.
data/small_example/calibrated_models_coefficients/mortality_model.yaml Updates model expression to standardized variable names.
data/small_example/calibrated_models_coefficients/marriage.yaml Updates spec names to standardized variable names.
data/small_example/calibrated_models_coefficients/kids_move_model.yaml Updates model expression to standardized variable names + new cross-products.
data/small_example/calibrated_models_coefficients/income_model_w_nworkers.yaml Adds new income model coefficients using RegressionStep.
data/small_example/calibrated_models_coefficients/income_model.yaml Adds baseline income model coefficients using RegressionStep.
data/small_example/calibrated_models_coefficients/edu_model.yaml Updates model expression to standardized variable names.
data/small_example/calibrated_models_coefficients/divorce_model.yaml Updates model expression to standardized variable names.
data/small_example/calibrated_models_coefficients/demos_out_labor_force.yaml Updates model expression to standardized variable names.
data/small_example/calibrated_models_coefficients/demos_in_labor_force.yaml Updates model expression to standardized variable names.
data/small_example/calibrated_models_coefficients/cohabitation.yaml Updates spec names to standardized variable names.
data/small_example/calibrated_models_coefficients/birth_model.yaml Updates model expression to standardized variable names.
configuration/demos_config_ref.toml Reorders modules, adds income + dtype normalization, removes CSV mapping table entry.
configuration/demos_config.toml Reorders modules, adds income + dtype normalization, removes CSV mapping table entry.
README.md Updates docs link, expands onboarding instructions, adds contact info.
.github/workflows/docker.yml Enables multi-platform Docker builds (amd64 + arm64).
Comments suppressed due to low confidence (10)

demos/variables.py:1

  • The edu_hs_ged Orca column is defined twice in the same module. The latter definition will silently override the former, which is error-prone and can hide future changes. Remove the duplicate block (or consolidate into a single definition) to ensure there is exactly one authoritative implementation.
    demos/variables.py:1
  • The edu_hs_ged Orca column is defined twice in the same module. The latter definition will silently override the former, which is error-prone and can hide future changes. Remove the duplicate block (or consolidate into a single definition) to ensure there is exactly one authoritative implementation.
    demos/variables.py:1
  • age_60plus returns a DataFrame because p is a DataFrame; Orca columns are expected to return a 1D Series aligned to the table index. Use the age Series (e.g., persons.to_frame(...)[\"age\"]) so this returns a Series.
    demos/templates/estimated_models/regression_model.py:1
  • The constructor uses a mutable default argument (tags=[]). This can cause tags to leak across instances if the list is mutated. Use tags=None and normalize to an empty list inside __init__.
    docker-compose.yml:1
  • The service hard-pins the platform to linux/amd64. With ARM64 images now being built, this forces amd64 emulation on ARM hosts (e.g., Apple Silicon), which can significantly degrade performance and sometimes breaks native execution. Consider removing platform: (let Docker pick the native arch) or making it configurable via an env var.
    scripts/clean_columns.py:1
  • This script introduces non-deterministic outputs by using np.random.choice without a seed, which makes it hard to reproduce results across runs. Consider adding a --seed argument (or documenting that this is purely for synthetic/testing use) and seeding a local RNG. Also fix the typo in the comment (hardn0codedhardcoded) to avoid confusion.
    scripts/clean_columns.py:1
  • This script introduces non-deterministic outputs by using np.random.choice without a seed, which makes it hard to reproduce results across runs. Consider adding a --seed argument (or documenting that this is purely for synthetic/testing use) and seeding a local RNG. Also fix the typo in the comment (hardn0codedhardcoded) to avoid confusion.
    pyproject.toml:1
  • Correct organization name spelling: 'Berkley' should be 'Berkeley' in the author metadata.
    pyproject.toml:1
  • This change removes the [project.scripts] entry that previously exposed a demos CLI command. If users rely on pip install . providing a demos executable, this is a breaking change. Either restore the script entry (if still supported) or update documentation to describe the new invocation method.
    docs/source/pages/calibration.md:1
  • The term 'marrital' is misspelled in the docs (should be 'marital'). If the code truly hard-codes the misspelled table name, consider supporting the correctly-spelled alias as well (or correcting the code + docs together) to avoid locking in a typo as part of the external interface.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread demos/models/household_reorg.py
Comment on lines +339 to +344
# Max education year of head or spouse (relate < 2)
@orca.column("households")
def hh_edu_top(persons):
df = persons.to_frame(columns=["household_id", "edu", "relate"])
df = df[df["relate"] < 2][["household_id", "edu"]]
return df.groupby("household_id").agg({"edu": "max"})
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several @orca.column(\"households\") functions appear to return DataFrames (e.g., groupby(...).agg({...})) or return a column name (\"age\") that does not match the semantic meaning (hh_n_children). Orca columns should return a Series for the target table’s index. For hh_n_children, compute a per-household count and return it as a Series named hh_n_children. For hh_edu_top/hh_age_avg (and similar), return the aggregated Series (e.g., .max() / .mean()) rather than a single-column DataFrame.

Copilot uses AI. Check for mistakes.
Comment thread demos/models/household_reorg.py
Comment thread demos/models/income.py
Comment thread demos/models/income.py
Comment thread README.md Outdated
Comment thread demos/models/constants.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants