Income model, update documentation, examples, and add ARM build to pipeline#86
Income model, update documentation, examples, and add ARM build to pipeline#86yamilbknsu wants to merge 31 commits intodevfrom
Conversation
Update documentation pre-release
added contact information.
Update small example data
Added `arm` build to docker pipeline
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR expands DEMOS with a new income model + regression template, reorganizes/standardizes computed variables, and significantly improves documentation and Docker support (including ARM64 builds).
Changes:
- Add an OLS regression template and an income model step with supporting household/person variables.
- Refactor and relocate many Orca computed columns into model modules; refresh example calibrated coefficient YAMLs accordingly.
- Expand Sphinx docs (calibration, variables reference, data sources) and update Docker publishing to build for
linux/amd64andlinux/arm64.
Reviewed changes
Copilot reviewed 41 out of 43 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/clean_columns.py | Adds a utility script to subset H5 columns and synthesize job fields. |
| pyproject.toml | Bumps version, updates metadata/URLs, removes project script entry. |
| docs/source/pages/variables.md | Adds a comprehensive variables reference page. |
| docs/source/pages/intro.md | Expands Docker Compose onboarding instructions. |
| docs/source/pages/datasources.rst | Adds detailed docs on CSV vs HDF5 table sources and config examples. |
| docs/source/pages/configuration.rst | Links new calibration + variables pages into configuration docs. |
| docs/source/pages/calibration.md | Adds detailed calibration documentation and examples. |
| docs/source/pages/advanced_configuration.md | Adds explanation of lazy Orca columns, caching, and GEOID assignment. |
| docs/source/conf.py | Updates org metadata and enables MyST dollarmath. |
| docker-compose.yml | Updates default GHCR namespace for the image. |
| demos/variables.py | Large refactor/cleanup of computed variables; introduces standardized naming. |
| demos/templates/estimated_models/regression_model.py | Adds RegressionStep (OLS) estimated model template. |
| demos/templates/estimated_models/init.py | Exposes RegressionStep from estimated_models package. |
| demos/templates/calibration/procedures.py | Broadens calibration typing to TemplateStep and improves logging messages. |
| demos/models/marriage.py | Changes matching sort key from earning to computed per-person household income. |
| demos/models/kids_moving.py | Moves kids-move computed columns into kids_moving module. |
| demos/models/income.py | Introduces income step and associated household columns for income regression. |
| demos/models/household_reorg.py | Renames head-race columns and adds household aggregate computed columns. |
| demos/models/fatality.py | Moves mortality age-bin computed columns into fatality module. |
| demos/models/employment.py | Stops updating earnings in employment step; moves employment age bins into module. |
| demos/models/data_fix.py | Adds normalize_table_dtypes step to reduce PyTables object-dtype warnings. |
| demos/models/constants.py | Adds relational adjustment mapping table and state quartile labels. |
| demos/models/birth.py | Updates head-race column names and adds birth-related household computed columns. |
| demos/models/aging.py | Adds hh_head_age computed column (households). |
| demos/models/init.py | Imports income + constants modules for initialization side effects/registration. |
| demos/config.py | Adds normalize_table_dtypes to default module ordering. |
| data/small_example/calibrated_models_coefficients/mortality_model.yaml | Updates model expression to standardized variable names. |
| data/small_example/calibrated_models_coefficients/marriage.yaml | Updates spec names to standardized variable names. |
| data/small_example/calibrated_models_coefficients/kids_move_model.yaml | Updates model expression to standardized variable names + new cross-products. |
| data/small_example/calibrated_models_coefficients/income_model_w_nworkers.yaml | Adds new income model coefficients using RegressionStep. |
| data/small_example/calibrated_models_coefficients/income_model.yaml | Adds baseline income model coefficients using RegressionStep. |
| data/small_example/calibrated_models_coefficients/edu_model.yaml | Updates model expression to standardized variable names. |
| data/small_example/calibrated_models_coefficients/divorce_model.yaml | Updates model expression to standardized variable names. |
| data/small_example/calibrated_models_coefficients/demos_out_labor_force.yaml | Updates model expression to standardized variable names. |
| data/small_example/calibrated_models_coefficients/demos_in_labor_force.yaml | Updates model expression to standardized variable names. |
| data/small_example/calibrated_models_coefficients/cohabitation.yaml | Updates spec names to standardized variable names. |
| data/small_example/calibrated_models_coefficients/birth_model.yaml | Updates model expression to standardized variable names. |
| configuration/demos_config_ref.toml | Reorders modules, adds income + dtype normalization, removes CSV mapping table entry. |
| configuration/demos_config.toml | Reorders modules, adds income + dtype normalization, removes CSV mapping table entry. |
| README.md | Updates docs link, expands onboarding instructions, adds contact info. |
| .github/workflows/docker.yml | Enables multi-platform Docker builds (amd64 + arm64). |
Comments suppressed due to low confidence (10)
demos/variables.py:1
- The
edu_hs_gedOrca column is defined twice in the same module. The latter definition will silently override the former, which is error-prone and can hide future changes. Remove the duplicate block (or consolidate into a single definition) to ensure there is exactly one authoritative implementation.
demos/variables.py:1 - The
edu_hs_gedOrca column is defined twice in the same module. The latter definition will silently override the former, which is error-prone and can hide future changes. Remove the duplicate block (or consolidate into a single definition) to ensure there is exactly one authoritative implementation.
demos/variables.py:1 age_60plusreturns a DataFrame becausepis a DataFrame; Orca columns are expected to return a 1D Series aligned to the table index. Use theageSeries (e.g.,persons.to_frame(...)[\"age\"]) so this returns a Series.
demos/templates/estimated_models/regression_model.py:1- The constructor uses a mutable default argument (
tags=[]). This can cause tags to leak across instances if the list is mutated. Usetags=Noneand normalize to an empty list inside__init__.
docker-compose.yml:1 - The service hard-pins the platform to
linux/amd64. With ARM64 images now being built, this forces amd64 emulation on ARM hosts (e.g., Apple Silicon), which can significantly degrade performance and sometimes breaks native execution. Consider removingplatform:(let Docker pick the native arch) or making it configurable via an env var.
scripts/clean_columns.py:1 - This script introduces non-deterministic outputs by using
np.random.choicewithout a seed, which makes it hard to reproduce results across runs. Consider adding a--seedargument (or documenting that this is purely for synthetic/testing use) and seeding a local RNG. Also fix the typo in the comment (hardn0coded→hardcoded) to avoid confusion.
scripts/clean_columns.py:1 - This script introduces non-deterministic outputs by using
np.random.choicewithout a seed, which makes it hard to reproduce results across runs. Consider adding a--seedargument (or documenting that this is purely for synthetic/testing use) and seeding a local RNG. Also fix the typo in the comment (hardn0coded→hardcoded) to avoid confusion.
pyproject.toml:1 - Correct organization name spelling: 'Berkley' should be 'Berkeley' in the author metadata.
pyproject.toml:1 - This change removes the
[project.scripts]entry that previously exposed ademosCLI command. If users rely onpip install .providing ademosexecutable, this is a breaking change. Either restore the script entry (if still supported) or update documentation to describe the new invocation method.
docs/source/pages/calibration.md:1 - The term 'marrital' is misspelled in the docs (should be 'marital'). If the code truly hard-codes the misspelled table name, consider supporting the correctly-spelled alias as well (or correcting the code + docs together) to avoid locking in a typo as part of the external interface.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Max education year of head or spouse (relate < 2) | ||
| @orca.column("households") | ||
| def hh_edu_top(persons): | ||
| df = persons.to_frame(columns=["household_id", "edu", "relate"]) | ||
| df = df[df["relate"] < 2][["household_id", "edu"]] | ||
| return df.groupby("household_id").agg({"edu": "max"}) |
There was a problem hiding this comment.
Several @orca.column(\"households\") functions appear to return DataFrames (e.g., groupby(...).agg({...})) or return a column name (\"age\") that does not match the semantic meaning (hh_n_children). Orca columns should return a Series for the target table’s index. For hh_n_children, compute a per-household count and return it as a Series named hh_n_children. For hh_edu_top/hh_age_avg (and similar), return the aggregated Series (e.g., .max() / .mean()) rather than a single-column DataFrame.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
This pull request introduces several important updates to the DEMOS demographic microsimulator, focusing on improved documentation, expanded Docker support, configuration and model enhancements, and codebase improvements. The changes make the project more accessible to new users, add support for ARM64 Docker builds, update model configurations and coefficients for greater clarity and accuracy, and introduce new model modules and utility functions.
Documentation and Usability Improvements:
README.mdhas been significantly expanded to provide clearer setup instructions, including prerequisites, example data usage, and troubleshooting for Docker memory limits. Documentation links have been updated and a new contact section has been added. [1] [2] [3] [4]Docker and Workflow Enhancements:
linux/amd64andlinux/arm64), allowing the project to be run on a wider range of systems, including Apple Silicon Macs.Configuration and Model Updates:
demos_config.tomlanddemos_config_ref.toml) have been reordered and expanded to include new modules such asincome,income_adjustment, andnormalize_table_dtypes. Redundant or unused table definitions have been removed for clarity. [1] [2] [3] [4]Core Codebase Improvements:
normalize_table_dtypesandincome, and to ensure all constants are available. [1] [2]hh_head_agehas been added todemos/models/aging.pyto efficiently retrieve the age of the household head, supporting new model requirements.