Switch to OEDI Format#75
Open
rajeee wants to merge 132 commits into
Open
Conversation
New [unique_keys] TOML block with metadata and timeseries key lists (timeseries must be a subset of metadata). Three new schemas ship with it: comstock_oedi_state_and_county, resstock_oedi_new, resstock_oedi_vu. bs_key, up_key, ts_key tuples (plus *_key_cols properties) expose the configured unique-key identity of each table. Aggregate sample counts, load-duration grouping, and timesteps verification now use these tuples via _count_distinct() and multi-column group-by. Single-key schemas still emit byte-identical SQL. Also carries prior session fixes: CACHED sentinel skipped in batch helpers, cost log precision bumped to 2 decimals, and _get_rows_per_building groups on the configured timeseries keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_get_applied_in_subquery now projects all metadata key columns, and _add_applied_in_restrict emits a tuple-IN filter against the correct side (ts_key when timeseries-backed, bs_key otherwise). Single-key schemas still emit the original single-column IN form. _get_restrict_clauses and _normalize_restrict_subquery extended to accept a tuple of columns paired with a multi-width subquery, using sa.tuple_(c1,...).in_(subquery) under the hood. Also fix an outdented return in _get_buildings_by_change that broke the module at import time under strict linting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
get_building_ids and get_buildings_by_locations now return DataFrames projecting all bs_key columns instead of just bldg_id. _get_upgrade_buildings and get_buildings_by_change return list[int] for single-key schemas and list[tuple] for multi-key. _get_full_options_report's array_agg uses row(...) packing when the upgrade key has more than one column. Single-bldg_id default paths keep byte-identical SQL and return shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
get_results_csv, get_results_csv_full, get_upgrades_csv, and get_upgrades_csv_full now return flat DataFrames. Parquet readbacks that still carry a bldg_id index are reset so the returned shape is consistent regardless of source. Breaking change: callers that relied on df.loc[bldg_id] must now do df[df.bldg_id == bldg_id] or set the index themselves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Old check compared row counts between timeseries and baseline/upgrade
tables, which produced false positives whenever a bldg_id could
legitimately appear in baseline more than once (e.g. per state).
New flow:
1. Distinct-bldg parity per upgrade between timeseries and baseline/upgrade
tables, using bs_key and ts_key for distinct counting. A dedicated
_get_metadata_distinct_bldg_count replaces the row-counting derivation
from get_success_report.
2. Duplicate detection per table: group by bs_key on baseline, up_key +
upgrade on the upgrade table, and ts_key + timestamp (+ upgrade when
present) on the timeseries table. First 5 offending rows are printed.
3. Existing per-building row-count verification retained, now phrased in
terms of partitions rather than buildings.
_get_ts_report now counts distinct ts_key tuples instead of distinct bldg_id.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Near-duplicate of _get_rows_per_building with no callers in the package. Drop it rather than maintain two versions through the multi-key refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- bs_key / up_key / ts_key reflect configured unique_keys and revert to (bldg_id,) when cleared. - applied_in emits tuple-IN for multi-key schemas and scalar-IN for single-key. - aggregate_timeseries with timestamp_grouping_func emits count(DISTINCT (c1, c2)) for multi-key and count(distinct(c1)) for single-key. - get_building_ids projects all bs_key columns. Also switch _count_distinct to sa.distinct(sa.tuple_(*cols)) so the multi-column form renders as the standard SQL `count(DISTINCT (c1, c2))` rather than `count(distinct(c1, c2))`, which Trino parses as a single-arg function call on a row tuple. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The branch earlier swapped flake8 for ruff with an aggressive ruleset (E, W, F, I, UP, B) that flagged ~280 pre-existing UP/B violations unrelated to unique_keys work. Revert to flake8 with the same config used on main (.flake8, max-line-length = 120) so this PR stays focused on multi-key unique-keys changes. Also wrap two 137-char lines in _get_metadata_distinct_bldg_count so flake8 passes cleanly.
UNLOAD writes query results as multiple parquet files, and row order across chunks is not guaranteed when read back. test_savings_shape_with_timestamp_grouping was intermittently failing because the time column started at different points depending on which parquet chunk pandas read first. Sort df1 and df2 by (geometry_building_type_recs, time) before comparing so the assertion is order-independent.
Set the pandas index back on the four CSV-returning functions to preserve backward compatibility with callers that rely on df.loc[bldg_id]. For single-key schemas the index is a single-level Index on bldg_id (byte-identical to main). For multi-key schemas it becomes a pandas MultiIndex on the configured bs_key / up_key tuple. get_results_csv, get_results_csv_full, get_upgrades_csv, and get_upgrades_csv_full all use `set_index(list(self.bs_key))` (or up_key). Full variants reset any existing index from the parquet before re- indexing so the result is consistent regardless of how the parquet was written.
Fixes ValueError when querying timeseries data against schemas whose timeseries unique key spans multiple columns (e.g. bldg_id + state). _add_applied_in_restrict now threads a width-matched subquery to the correct side, _split_restrict routes tuple column refs to the ts side, and restrict/avoid accept (col_tuple, subquery | list-of-tuples).
bs_tbl in this branch is a ts-side subquery whose columns match the timeseries unique key, not the metadata key. Looking up bs_key on it raises KeyError for schemas whose metadata key has cols absent from ts (e.g. in.nhgis_tract_gisjoin). Surfaced by a monthly TS query against the comstock_oedi_state_and_county schema.
Clears the deprecated query surface ahead of the 2.0 release so the new bsq.query(...) facade is the single public path for annual, timeseries, and savings aggregations. Internal callers (utility_query, viz_data) are migrated, savings_query is deleted outright, and the savings_shape-compat branches in aggregate_query._query collapse to single canonical paths. Also drops TSQuery.split_enduses (application code should batch) and TSQuery.collapse_ts (use timestamp_grouping_func="year"); SavingsQuery schema is removed.
The else-branch in init_monthly_results referenced run_obj._conn, which does not exist on BuildStockQuery, and used run_obj.params instead of run_obj.run_params for the S3 unload bucket. Both were latent bugs that only fired on cache miss; the prior aggregate_timeseries SQL happened to match cached entries, so the branch never ran in CI. Replace the hand-rolled unload/cache dance with a direct run_obj.query() call, which routes through execute() and handles caching, S3 result lookup, and UNLOAD submission consistently with the rest of the library. Note: test_Viz fixtures may need their reference pkl cache regenerated against the new SQL text; that is a separate fixture-maintenance task.
When a timeseries query asks only for upgrade output (no savings, no baseline) with applied_only=True, the ts_b/ts_u subquery pairing adds unnecessary cost and, more importantly, returns zero rows for runs whose timeseries table lacks upgrade=0 rows (upgrade-only runs). The column expressions for applied_only reference only up_tbl, so we can aggregate ts_u directly with the same shape the upgrade_id="0" branch already uses. Also refresh the viz test reference caches with entries keyed off the new SQL text.
Relocate the pre-snapshot test modules (test_BuildStockQuery.py, test_BuildStockUtility.py, test_query.py, test_Viz.py, and their helpers) into tests/legacy/ so pytest stops collecting them. The new snapshot-driven framework replaces them and the old files are kept for reference.
Introduce a test framework that pins BuildStockQuery.query() generated SQL
and reference result data per curated argument set, plus cross-query
invariant checks that verify internal consistency without hitting Athena.
Framework (tests/test_utility.py, tests/conftest.py,
tests/test_query_snapshots.py):
- Each snapshot entry in tests/query_snapshots/{schema}/*.json carries
one or more equivalent arg-variants, a .sql file (exact generated
SQL), and a .parquet file (reference DataFrame).
- Normalized-whitespace SQL comparison with sqlglot fallback; variant
divergence surfaces as per-variant failure blocks.
- Pytest options --check-data, --update-sql, --update-sql-and-data
control data comparison and (re)generation of reference files.
- Session-scoped fixtures for comstock_oedi_state_and_county and
resstock_oedi_vu schemas against buildstock_sdr on rescore.
Invariants (tests/test_invariants.py):
- annual == ts-year-collapse == sum(ts-monthly) for enduse totals.
- sample_count / units_count consistency across query flavors (resstock
only; comstock counts intentionally skipped pending a library fix).
- baseline - upgrade ~= savings within a single include_savings query.
Seeded with ~30 entries across annual/timeseries/savings/applied_only/
restrict_avoid/mapped_column for both schemas, CO-restricted and limited
to electricity/natural-gas totals to keep Athena usage small.
See tests/query_snapshots/README.md for the workflow.
_add_restrict already accepts annual_only to route restrict columns through the baseline vs timeseries table, but the caller in BuildStockAggregate._query was never passing it, while the adjacent _add_avoid call was. Bring restrict into line with avoid so annual-only queries consistently resolve restrict columns against the baseline table.
On schemas where one TS row maps to multiple metadata rows (e.g.
comstock_oedi_state_and_county, which has one metadata row per
(bldg_id, nhgis_tract_gisjoin, state) but one TS row per (bldg_id,
state)), the TS-grouping branch of BuildStockAggregate._query was
returning sample_count values that undercounted by the per-TS-key
metadata multiplicity — empirically 1.4x-100x across building types.
Count distinct over the full metadata key on the baseline table
(already present in tbljoin) instead of over the TS key on the TS
subquery. For schemas where bs_key == ts_key (resstock_oedi_vu etc.)
the resulting count value is identical; SQL differs because columns
are now referenced via bs_table.
Comstock TS-grouping snapshot data needs regeneration after this
change:
pytest -s -v tests/test_query_snapshots.py --update-sql-and-data
Re-enable the sample_count/units_count check in the comstock invariant
test once snapshots are refreshed.
The library bug that made these undercount on comstock TS-grouping queries was fixed in 25fa3fa. Restore the active count-column comparisons to guard against regression. Now 4/4 invariants pass including the comstock counts.
Only _get_gcol (the group_by path) used to fall back to the prefixed
column name when a bare name didn't resolve. Restrict, avoid, and the
TS side of _split_restrict went straight to _get_column and raised
ValueError on bare names like "state" / "vintage" whenever the
underlying table only exposed the prefixed form ("in.state" etc).
Move the fallback into QueryCore._get_column itself so every caller
that resolves a user-provided string column gets the same behavior.
Simplify _get_gcol to drop its now-redundant except block.
Also trim trailing-newline inconsistency in
comstock_oedi_state_and_county.toml.
Now that _get_column auto-prepends the characteristics prefix, the
snapshot JSONs no longer need per-schema hand-prefixing for restrict
and avoid. Simplify to a uniform convention:
- group_by: write "in.<col>" explicitly (readable, consistent across
schemas).
- restrict / avoid: write bare "state" / "vintage" — the library's
auto-prefix fallback resolves to "in.state" / "in.vintage" when
that's what exists on the table.
Comstock group_by entries gain their "in." prefix; resstock
restrict/avoid entries drop theirs. Stored SQL is unchanged because
the resolved SQLAlchemy column objects are the same.
Now that QueryCore._get_column auto-prepends the characteristics prefix for both group_by and restrict/avoid, snapshot JSON entries can rely on bare column names everywhere. Stored .sql files are unchanged — the fallback resolves bare "state" / "geometry_building_type_recs" / "comstock_building_type" / "vintage" to the same SQLAlchemy column objects that were previously written out explicitly.
Collapse four near-duplicate test functions into two parametrized ones: test_annual_equals_ts_year_equals_ts_monthly_sum[resstock_...] test_annual_equals_ts_year_equals_ts_monthly_sum[comstock_...] test_savings_decomposition[resstock_...] test_savings_decomposition[comstock_...] Each case provides the fixture name, schema directory, per-leg enduse lists (comstock baseline uses "..kwh" suffix, TS does not), and the group-by column. The bsq fixture is resolved at test time via request.getfixturevalue so the case table stays declarative. Net: -48 lines, same 4 cases, no behavior change (15/15 pass).
Snapshot layout - Collapse per-schema JSON files into one shared file per flavor at tests/query_snapshots/<flavor>.json. Per-schema differences in column names (electricity, gas, building_type, vintage, mapping_dict, etc.) go through $PLACEHOLDER substitution at load time, with leg-aware resolution for comstock's `..kwh` annual suffix. - Move artifacts to flat <schema>_sql/ and <schema>_data/ directories; drop the per-schema _CO filename suffix. Snapshot harness contract - Rename --update-sql/--update-sql-and-data to --update-snapshot/--overwrite-snapshot. --update-snapshot writes only refactor-safe diffs: cosmetic SQL drift, or real drift where the data still matches/missing/equivalent. Real data divergence requires --overwrite-snapshot. - Add `data_status='equivalent'` for shape-shifted-but-value-matching data (e.g. column added/removed without changing shared values). - Tolerate approx_percentile noise on array-valued columns (quartiles) via per-row scale-aware tolerance. - Add `_method` dispatch so snapshot entries can target methods other than `query()` (e.g. `get_building_ids`). SQL generation cleanup - Strip CAST(col AS VARCHAR) from generated SQL. Add `typed_literal` helper that coerces comparison values to match column SQL type at the call site (int for upgrade columns, bool for applicability). Trino can now use parquet stripe statistics for predicate pushdown. - Push user bs_restrict (e.g. state='CO') into the inner ts ⋈ bs JOIN ON condition so comstock's tract-denormalized metadata is filtered before the timeseries join. - _split_restrict applies columns living on both bs and ts (state, upgrade) to both sides, not just ts. - Raise UnsupportedQueryShape from __get_timeseries_bs_up_table's upgrade-pair branch on comstock; harness treats this as skip. API additions - get_building_ids(applied_in=...) returns building keys filtered to those that applied to all listed upgrades. Uses tuple-IN for comstock's multi-column unique key. Invariant tests - 8 new equality invariants exercising existing snapshot data plus one auxiliary building_ids snapshot trio: group_by sum = overall, CO subset of CO+WY, avoid + avoided = full, MappedColumn aggregates, 15-min raw sums to monthly, savings_only column equality, two-fuel electricity = single-fuel, applied_in intersection. - xfail mark for the known applied_only divergence on the timeseries upgrade-pair flow (TS table includes inapplicable buildings; annual flow filters them via up.applicability=true, TS flow doesn't). Test count: 39 passed, 1 skipped, 2 xfailed.
Previously the notebook used Path.cwd() with a 3-way fallback chain to find the snapshot test cache directory. That worked from the repo root, tests/, or tests/example_notebooks/, but could fail if Jupyter was launched from a deeper or unrelated directory. Use IPython's `_dh[0]` (the notebook's own directory at kernel startup) to resolve the cache as a relative sibling: `../query_snapshots/<schema>_cache`. Falls back to "." if `_dh` is missing (e.g. when the cell is run outside Jupyter), preserving backward compatibility with non-Jupyter execution. Verified end-to-end via `jupyter nbconvert --execute` on annual_resstock_oedi.ipynb — 13 cells executed, all queries returned cached results with $0.00 cost.
Generated notebooks now run through nbclient after writing so cells
contain real DataFrame outputs (rich text/html tables + text/plain)
instead of a pre-embedded preview. The notebook IS the runnable artifact
AND the source of truth for current behavior.
Two safety mechanisms keep regen cheap and bounded:
1. `_NO_EXECUTE_METHODS` allowlist — `report.get_success_report`,
`report.check_ts_bs_integrity`, and `agg.get_building_average_kws_at`
are commented out in source. Each fans out internally to many
Athena queries the snapshot cache doesn't cover (count-distinct
per upgrade, per-pair change tracking, the 3.2 TB landmine
documented in CLAUDE.md). Determined by reading method source —
only true singleton-execute methods run live.
2. `_has_unroundtrippable_placeholder` — auto-comments any cell whose
args contain SA object reprs (`<sqlalchemy....Label at 0x...>`,
`MappedColumn(bsq=<...object at 0x...>)`, or our own `<Class ...>`
fallback). Those can't run as written; the user must construct
the SA object inline.
`allow_errors=True` on the nbclient executor: a single bad cell embeds
its error in the notebook instead of aborting execution. This catches
pre-existing interactive-mode bugs in `bsq.get_results_csv` etc. that
the snapshot tests miss (they run with get_query_only=True, never
exercising the post-query DataFrame access).
End state: 121 cells executed cleanly, 8 commented (unsafe), 10
commented (placeholder), 6 cells with embedded errors documenting
real bugs to fix separately.
Removes unused imports (`os` in main.py and helpers.py, `sqltypes` in main.py) and a stale `ucol` local in aggregate_query.py, plus a missing blank line before a nested def and a 124-char comment line. CI Lint step was failing on these six pyflakes/pycodestyle violations.
The bs_per_bldg subquery (introduced to collapse ComStock's tract fan-out before joining ts_aggr) was wrapping bs-side group_by columns in arbitrary() and grouping only on (bldg_id, state). For ComStock — whose metadata is partitioned at (bldg_id, tract, state) granularity with weight divided across tract rows — a building straddling counties had its full weight attributed to whichever county arbitrary() picked, silently dropping the tract-fractional disaggregation. Same issue silently affected ResStock's utility_ts_by_eiaid: eiaid_weights fans out per-county into multiple utilities, so arbitrary(eiaid) collapsed the weighted apportionment. Fix: project bs-side group_by columns as bare references and add their underlying expressions to bs_per_bldg's GROUP BY. Pure-bs enduses (sqft, vintage) and extra_bs_cols stay under arbitrary() — they're true per-bldg constants. Adds three ComStock-only timeseries snapshots (year-collapse: by_county, by_tract, overall_co) plus an invariant cross-checking sum(by-county) == sum(by-tract) == overall. Replaces utility_ts_by_eiaid's eiaid=4110 (no CO data → empty) with eiaid=15466 (Xcel CO, ~9.3k samples) so the snapshot exercises the disaggregation path with real data.
SqlCache now appends each get/put hash to .cache_usage_log inside the cache folder so we can tell which entries a test session actually exercised. tests/cleanup_stale_caches.py removes cache files whose hashes are referenced in neither the snapshot JSONs nor the usage log; tests/refresh_caches.py wraps the clear → run-tests → cleanup flow. Also includes the stale entries that the first cleanup run identified and dropped, freeing the resstock_oedi_cache + comstock_oedi_cache directories of orphans left over from previous SQL refactors.
Timestamps and a few result-data fields refreshed from the most recent test session that exercised every notebook against the pruned cache.
Extends the test suite to cover the county-aggregated ComStock metadata table (`_md_agg_by_state_and_county_parquet`) where buildings within a county are pre-collapsed into one weight-correct row. The agg schema is wired into every schema-aware list/dispatcher (fixtures, snapshot and invariant parametrize lists, placeholder resolver, cleanup script, notebook builder, backfill costs) and the snapshot harness's `_patch_hash` now inserts a missing schema key in addition to overwriting an existing one. `--update-snapshot` populated the new `comstock_oedi_agg_cache/` with 76 sidecar triples covering 66 JSON entries.
Walks every snapshot JSON entry across six weight-aggregating flavors (annual, timeseries, savings, applied_only, restrict_avoid, invariants_three_way), runs the query on both ComStock schemas, and asserts equal DataFrames after dropping `sample_count` (the only column that differs because the agg table has fewer rows). Filters out nondeterministic entries and aggregations that don't survive county pre-collapse (mean/max/min/quartiles/arbitrary). A separate test handles `get_building_ids`: the two schemas return different partition columns (tract vs county), so the comparison projects to (bldg_id, state) and dedupes — the resulting building set is identical across schemas. Both tests run from the snapshot caches in under 6 seconds with no Athena spend.
Adds the three-way invariant on the applied_in=[1,2] code path that combines flows the existing tests covered separately but never together: annual ≡ ts-year-collapse ≡ sum(ts-monthly), now exercised on both the categorical comstock_building_type axis and the county partition-key axis whose arbitrary()-collapse silently broke before commit 182ff21. Three new snapshot entries (annual / ts_year / ts_monthly grouped by county, applied_only=true, applied_in=[1,2], CO) populate the cache for both ComStock schemas, gated to comstock_oedi/comstock_oedi_agg via the JSON `schemas` allowlist since resstock has no county column.
`Query.applied_in` only expressed the all-of-listed-upgrades intersection
semantic. Real callers want any/all/not combinations that don't compose
cleanly as more flags. Drop the field and add two helpers:
- `bsq.get_applied_buildings(any_of, all_of, get_query_only)` — returns
buildings (DataFrame or SQL) matching the predicate. Snapshot-testable.
- `bsq.get_applied_buildings_filter(any_of, all_of)` — returns a
`(cols, subquery)` tuple to drop into `restrict=[...]` or `avoid=[...]`.
Empty lists return None so callers can pass it through unconditionally.
Both route through a shared private `_build_applied_subquery` that emits:
- all_of only: `WHERE upgrade IN ... GROUP BY keys HAVING count = N`
(byte-stable with the prior applied_in shape)
- any_of only: `WHERE upgrade IN ... GROUP BY keys` (no HAVING)
- both: AND'd CASE-filtered counts
0 (baseline) is rejected in either list — applicability is meaningful
only for upgrades.
`Query.applied_only` stays untouched. It preserves two annual-path
optimizations the WHERE-side helper can't: the bs⋈up INNER vs LEFT OUTER
JOIN switch, and the bare `up.col` projection vs the
`CASE WHEN applied THEN up.col ELSE bs.col END` form.
Internal callers that previously synthesized `applied_in=[upgrade_id]` on
the TS path (the load-bearing inapplicable-buildings filter at
aggregate_query.py:622-643) now build the same subquery directly via
`_build_applied_subquery(all_of=[upgrade_id], key_kind="timeseries")`.
Behavior is unchanged.
`_split_restrict` learned to handle md-side composite-key tuple restricts
— previously only the TS side did, since the only tuple-keyed restricts
came from the (TS-side) applied_in subquery.
Snapshot JSON entries that used `applied_in` are migrated to a new
`{"_applied_filter": {"all_of": [...]}}` sentinel that the dispatcher in
`tests/test_utility.py:_dispatch_method` resolves into a live
`get_applied_buildings_filter` call before invocation. Most migrated
entries produce byte-stable SQL when the filter is appended after the
existing restrict items (matching the old append order); the four TS
entries with explicit applied_in=[1,2] now produce a redundant double
applied filter (auto-synthesized for upgrade_id + user-supplied for [1,2])
which Trino merges at planning. Those four hashes are blanked with a note
— data check confirms equivalence; user runs --update-snapshot to backfill.
Also picks up the user's separate notebook-builder plan in this commit
since the changes intermixed in shared files: nbclient strict mode
(`allow_errors=True` removed, failure surfaces to pytest), shared
`expand_rate_map_flat` helper, and the `_was_wrapped` cache-poisoning
guard for list/dict/Series-returning methods.
Three real cell failures in tests/example_notebooks/utility_resstock_oedi.ipynb were silently embedded in cell outputs because nbclient ran with allow_errors=True (the strict-mode flip and the test_utility/test_query_snapshots plumbing for that — wrap-marker, cache-poisoning guards, expand_rate_map_flat helper — landed in commit a44bed7). This change finishes the loop: - buildstock_query/utility_query.py: get_locations_by_eiaids now reads the result via res.iloc[:, 0] instead of res[map_eiaid_column]. Athena UNLOAD names projections positionally when the SELECT-DISTINCT column has no alias, so the parquet column is "value" rather than "gisjoin", and the named lookup raised KeyError on every live invocation. - tests/diagnose_wrapped_cache.py: read-only diagnostic that walks every <hash>.parquet/.sql pair and reports any whose column count contradicts the SELECT-list count in the .sql sidecar. Surfaces the wrap-then-writeback cache poisoning whenever it crops up; runs in seconds. - Deleted comstock_oedi_agg_cache/629ce88f...{parquet,sql,json}: orphaned by the applied_filter SQL refactor; the new SQL hash already has its own triple on disk. - Re-executed utility_resstock_oedi.ipynb and report_comstock_oedi_agg.ipynb under the now-strict nbclient. All cells succeed.
…nts.json
The invariants suite was issuing many queries that no snapshot JSON
owned. Without ownership, those cache parquets had no SQL drift check
and couldn't be repopulated via --update-snapshot — and refresh_caches
was hard to reason about because hashes touched by invariants weren't
declared anywhere.
Solution: each invariant call site adds a sibling record_query(bsq,
args) line. The recorder writes a JSONL log per session;
normalize_invariant_snapshot.py merges into from_invariants.json with
blank sql_hash for new entries (user fills via --update-snapshot)
and prunes entries no longer recorded.
The args dict you pass IS what gets serialized — no placeholder
reverse-encoding. One entry per (schema, args) — each schema gets
its own row because resolved column names differ.
Markers for non-JSON-encodable shapes:
- {"_applied_filter": {"all_of"|"any_of": [...]}} for the live SA
Subquery returned by get_applied_buildings_filter().
- {"_calc_column": {"name", "expr", "table"}} for SA Label produced
by get_calculated_column().
- {"_mapped_column": {"name", "key_column", "mapping_dict"}} for
MappedColumn(...).
The new _resolve_calc_and_mapped_columns helper rebuilds the live SA
objects when the snapshot harness loads from_invariants.json.
Other changes:
- tests/example_notebook_builder.py: render the three markers as
inline construction calls so generated notebooks are runnable
Python; import MappedColumn at the top of every notebook.
- tests/test_utility.py: drop --update-snapshot/--overwrite-snapshot
as triggers for notebook regeneration. Trigger is now strictly
"missing notebook" or "data writethrough fired" — avoids spurious
regen + working-tree churn on no-op --update-snapshot runs.
- tests/refresh_caches.py: insert a normalize step (with --prune
when invariants ran fully) between the test runs and cleanup.
- tests/query_snapshots/building_ids.json: three new multi-state
entries owning the cache hashes that test_applied_buildings_intersection
previously hit without coverage.
- tests/find_uncovered_cache.py: read-only diagnostic that lists
cache hashes touched by tests (.cache_usage_log) but not owned
by any JSON sql_hash. Useful for future audits.
- tests/query_snapshots/from_invariants.json: 181 auto-recorded
entries from the wrapped invariant call sites.
Result of running tests/refresh_caches.py end-to-end: - cleanup_stale_caches.py --delete removed cache parquet/sql/json triples no longer referenced by any JSON snapshot or recorded by the test suite. - Example notebooks under tests/example_notebooks/ regenerated to reflect the current cache state. - example_usage/ notebooks re-executed against the refreshed cache.
…g_ids The TS query path replaced bs_table in the outer FROM with bs_per_bldg (per the recent pre-aggregation refactor), but `_add_avoid` still resolved bs columns to self.bs_table — producing a Cartesian comma-join against the outer FROM and silently dropping the NOT-IN predicate. Same shape hits ts-side avoid clauses. Fix by splitting `params.avoid` like `_split_restrict` does for restrict: bs-side avoids fold into bs_per_bldg's WHERE, ts-side fold into ts_flat's WHERE, and only join_list-side residue stays at the outer level. Also add `avoid` to `get_building_ids` (mirrors the existing `restrict` arg, additive default). Lets callers select the complement of an applied-buildings filter without going through `bsq.query`.
…metic
Three new tests in tests/test_invariants.py validate the
get_applied_buildings_filter(any_of=, all_of=) API end-to-end:
- test_applied_buildings_union: any_of=[a,b] building set equals union
of all_of=[a] ∪ all_of=[b]; meaningfulness gate asserts union ≠
intersection (otherwise the identity is trivially true and the test
is vacuous).
- test_applied_buildings_avoid_complement: avoid=[applied_filter]
returns universe \ applied_set; parametrized over filter_kind ∈
{all_of, any_of} so both filter shapes go through the avoid path.
- test_applied_buildings_set_identities_via_query: aggregated route
via bsq.query() over a curated 8-bldg universe spanning four regions
(only_a, only_b, both, neither). Runs 9 filters × 5 group_by = 45
queries per (schema × annual_only) cell, asserting (a) sum(units_count)
is invariant to group_by choice under any one filter, and (b) 14 set-
arithmetic identities including inclusion-exclusion, region
decompositions, and "restrict + matching avoid == universe" pairings.
This last test caught the TS-path avoid bug fixed in 22ef15b.
Helpers:
- _pick_meaningful_upgrade_pair: auto-discovers (a, b) such that all
four region splits are non-empty, cached on bsq instance. Avoids
brittle hardcoded upgrade ids that may apply identically per schema.
- _curate_applied_universe: picks 8 bldg_ids spanning all four regions
for bounded TS query cost. Fails loudly if any region is empty.
Eight new tests in tests/test_invariants.py systematically attack code paths that an adversary would target. All bound to state=CO and either a curated 8-bldg universe or first-N applied bldg_ids to keep Athena cost in cents. - test_applied_filter_subquery_equals_id_list: IN-subquery encoding equals materialized IN-list encoding for the same effective set, plus idempotency (`restrict=[f, f] == [f]`) and cardinality identity (`n_applied + n_avoided == n_universe`). Composite-key schemas materialize ALL md-key tuples for the chosen bldg_ids so no building is partially sliced (caught a real test-design bug where first-N tuples missed county slices on comstock_agg). - test_applied_filter_multi_state_composition: Per-state composite-key tuples are disjoint across states (no state leakage), and per-state units_count totals sum to multi-state total under the same applied filter. - test_applied_filter_empty_set_degeneracy: When applied set is provably empty (all_of=[1..12]), restrict yields 0 rows and avoid yields all rows. Pins SQL `IN ()` semantics. - test_applied_filter_order_and_de_morgan: restrict=[A, B] == restrict=[B, A] (order independence). Avoid de Morgan: `avoid=[any_of=[a,b]]` == `avoid=[all_of=[a], all_of=[b]]`. - test_applied_only_equals_explicit_all_of: `applied_only=True` for upgrade u equals `applied_only=False, restrict=[applied_filter(all_of=[u])]`. Pins the equivalence between internal injection (aggregate_query.py:645) and the public API. Plus: |savings|≈0 on inapplicable buildings (`avoid=[applied_filter]` ⇒ baseline==upgrade). - test_applied_filter_ternary_union_and_singleton: `any_of=[a,b,c]` == ⋃ singletons (catches HAVING-count bugs at N>2). `all_of=[a]` == `any_of=[a]` (singleton edge case). - test_applied_filter_avoid_id_list_equals_avoid_subquery: NOT-IN-subquery == NOT-IN-list within curated universe. Catches NOT-IN semantic drift. - test_applied_filter_calc_column_composition: calc(elec - gas) under applied filter equals manual elec - gas decomposition. Pins calc-column rebinding through the ClauseAdapter path under restrict=[applied_filter]. All 24 cases (8 tests × 3 schemas) PASS after fixing two test-design issues caught during validation: (1) IN-list length exceeded Athena's 262144-char limit for unbounded materialized lists, fixed by curated universe + per-bldg full-tuple materialization; (2) sorted-first-N sliced composite-key tuples mid-building, fixed by selecting unique bldg_ids first then all their tuples.
Three new tests in tests/test_invariants.py target the user's worry about miscounting buildings across grouping × baseline-join × TS code paths. - test_count_partition_under_group_by: per-group sum(units_count) and sum(sample_count) over four group_by axes (bldg_id, county, bldg_type, county+bldg_type) must equal the ungrouped totals. sample_count compared at integer precision (no rtol drift cover). Catches tract-fan-out, GROUP BY NULL drops, arbitrary() collapse bugs that don't show up in single-axis tests. - test_count_partition_under_applied_filter: same partition invariant composed with an applied filter, bounded to the curated 8-bldg universe. Catches grouping bugs that only surface under the applied-subquery composition. - test_count_integrity_annual_vs_ts_year_collapse: integer-exact sample_count and float-close units_count agreement between annual baseline (`upgrade_id=0`) and TS year-collapse baseline (`annual_only=False, timestamp_grouping_func='year'`), across multiple group_by axes and with/without applied filter. Pins the bs↔ts join wiring through bs_per_bldg pre-aggregation: any drop or duplication on the upgrade=0 side flips integer sample_count immediately. All 9 cases (3 tests × 3 schemas) PASS. Snapshot cache refreshed where these queries land in new shapes; existing entries unchanged.
Three new tests in tests/test_invariants.py compose the applied-buildings filter with code paths that the existing single-axis tests don't cover. - test_applied_filter_savings_decomposition_per_row: under `include_baseline=include_upgrade=include_savings=True` and an applied filter, the row-level identity `b - u ≈ s` must hold for every row (not just per-group totals). Catches savings-column wiring bugs that resolve against a different bs/up subquery pair than the baseline/ upgrade columns. - test_applied_filter_mean_sum_consistency: pins consistency between `agg_func='mean'` and the default-sum query under an applied filter. Checks that sample_count agrees integer-exact between the two and that `sum / (mean * n)` (the implied per-bldg average weight) is positive, finite, and within a plausible weight range. Catches the mean branch silently swallowing or doubling weight relative to sum. - test_applied_filter_15min_sums_to_monthly: 15-min raw aggregation rolled up to monthly in pandas must equal a direct `timestamp_grouping_func='month'` query under the same applied filter. Pins the inner ts_flat filter composition through the monthly aggregation pivot. Bounded to the curated 8-bldg universe (state=CO) so the TS scan is ~kilobytes per query. All 9 cases (3 tests × 3 schemas) PASS. Caught two test-design bugs during validation: (1) `agg_func='mean'` suffixes the column with `__mean` but my test looked for the unsuffixed name; (2) the result frame names the timestamp column after the schema's timestamp_column_name, not the `"time"` group_by alias.
Targets the user's specific worry: counts may drift when grouping by county across multiple states because that grain stresses both the state-partition wiring and the bs_per_bldg → ts_aggr join the most. Four new tests in tests/test_invariants.py: - test_multistate_county_partition: under multi-state restrict ([CO, WY] for resstock, [CO, NM] for comstock), the (state, county)-grouped totals must equal state-grouped totals must equal the ungrouped total. Plus: state-grouped multi-state rows match per-state-restricted single-queries (units_count float-close, sample_count integer-exact). Catches state-boundary leakage and cross-state county-name collisions silently merging. - test_multistate_county_x_bldg_type_decomposition: under multi-state restrict, the (state, county, bldg_type)-grouped sums roll up to (state, county)-grouped sums when summed over bldg_type. Per (state, county) integer sample_count and float-close units_count. Catches multi-key group_by losing rows at the tightest aggregation grain. - test_annual_vs_ts_year_at_county_grain_multistate: the smoking-gun test for the user's worry. Annual baseline and TS year-collapse baseline must agree per (state, county) at integer sample_count and float-close units_count. Bounded to a curated cross-state 16-bldg universe so the TS scan is small. Pins the bs_per_bldg pre-aggregation × state-partition × county group_by interaction in one query pair. - test_multistate_savings_shape_at_county_grain: maximum combinatorial stress: multi-state restrict + group_by=[state, county] + include_baseline + include_upgrade + include_savings + applied_only on upgrade leg. Per row `b - u ≈ s` with units_count-scaled tolerance, plus positivity sanity (sample_count > 0, units_count > 0). Helpers: - _county_col_for_schema: per-schema county column choice (in.county_name / county / in.nhgis_tract_gisjoin). - _county_result_col: result-frame name after _simple_label strips the in./out. prefix (caught a real test bug where I indexed result frames with the dotted column name). All 12 cases (4 tests × 3 schemas) PASS. The TS-table queries are bounded to ~16 curated bldg_ids in the multi-state restrict so each hits cache in seconds.
The inner ts_flat subquery aliases each per-row TS scalar so the next
layer can split it per-upgrade with FILTER. The old `_v__` prefix was
opaque; `ts__` matches the existing `bs__`/`up__` prefix family and
makes the table-of-origin explicit.
Source change is 5 sites in aggregate_query.py plus 2 comments in
test_schema_unique_keys.py — purely an internal alias rename, no
semantic change. Cached query results are byte-identical, so instead
of re-running 271 affected Athena queries (~3 h, ~$1.10), the rename
was applied directly to the on-disk cache:
- rename_alias.py rewrites cached SQL, recomputes sha256 hashes,
renames the .sql/.parquet/.json triples, and updates sql_hash
entries in the snapshot manifests.
- 271 cache triples renamed across 3 schema caches.
- 137 sql_hash entries updated across 6 manifests.
- Fixed-point check: re-running the script reports 0 pending work.
query_stats.py is also added — a general-purpose summary script that
walks *_cache/*.json and reports per-schema query count, bytes
scanned, $ cost @ $5/TB, and wall-clock time.
ComStock queries averaged ~46 s vs ResStock's ~3.7 s in the snapshot suite, but the asymmetry concentrates on metadata-only queries (~42 s ComStock vs ~2.9 s ResStock). Two root causes were identified and quantified through a read-only probe script: 1. Small-files / over-partitioning. The ComStock `_md_*_by_state_and_county_parquet` tables hold 194,308 small parquet files across 3,133 (state, county) partitions; ResStock's `_metadata` has 17 files unpartitioned. Even with a state filter that prunes to one state, ComStock pays ~12 s planning + ~50 s of per-file footer reads. 2. Missing predicate propagation in `buildstock-query` SQL generation. When the outer query has `WHERE state='CO'` but the inner `IN (SELECT ... HAVING ...)` subquery does not, Athena scans all 3,133 partitions on the inner side. Pushing the same state filter into the inner subquery cuts a 58 s query to 5 s — an 11.6× speedup from a single duplicated predicate. Two pre-aggregated alternative tables already published by the data pipeline (`_md_agg_by_state_parquet`, `_md_agg_national_parquet`) go unused by the framework today; routing eligible queries to them would add another ~3× on top of fix #2. Adds: - tests/query_snapshots/investigate_partitions.py — read-only probe script with cost guardrails; results cached so re-runs are free. - tests/query_snapshots/_partition_probe_cache/ — content-addressed probe cache (matches existing schema-cache pattern). - tests/query_snapshots/partition_investigation.json — consolidated dump of all probe results. - tests/query_snapshots/INVESTIGATION_partition_overhead.md — full writeup of methodology, findings, and recommended priority order.
When the framework emits `IN (SELECT ... HAVING ...)` against the metadata table (e.g. for applied-buildings filters from `get_applied_buildings_filter()` or applied_only=True with savings), Athena does not propagate the outer query's `WHERE state='CO'` into the inner subquery. On ComStock metadata — partitioned `(state, county)` × ~62 upgrades = 194,308 small parquet files across 3,133 partitions — that left the inner side enumerating every partition, dominating wall-clock with ~12 s of planning + ~50 s of footer reads. Fix: in `_get_restrict_clauses`, pre-scan restrict for safe single- column predicates against bs_table columns (using the existing `_get_column` resolution so user-supplied `state` maps to ResStock's `in.state` and ComStock's bare `state`). For any subquery-valued restrict entry, inject those predicates into the subquery's WHERE — but only when the column is both projected by the subquery and part of the outer IN-clause column tuple. This is semantics-preserving by construction: the outer IN-tuple match logically implies the same predicate on the inner side. Verified end-to-end on shape C against all four ComStock metadata table variants (`investigate_partitions.py` re-run with new shape C2): Table layout Pre-fix (C) Post-fix (C2) Speedup _md_by_state_and_county_parquet 59.84 s 4.14 s 14.4× _md_agg_by_state_and_county_parquet 58.07 s 4.38 s 13.3× _md_agg_by_state_parquet 3.16 s 1.76 s 1.8× _md_agg_national_parquet 2.47 s 2.00 s 1.2× 44 ComStock snapshot variants drifted (one extra `AND bs.<col>=<val>` in the inner WHERE). Zero ResStock drift. Snapshots regenerated via --update-snapshot — every entry showed the expected speedup, with typical -90% bytes scanned and -90% wall-clock per query. Total regen spend: < $0.01. Updates `INVESTIGATION_partition_overhead.md` to mark Fix #1 as shipped with the measured speedups.
User asked whether deduplicating both TS and MD at query time (arbitrary() on TS, sum(weight) on MD, single-key bldg_id join) would beat the routing-to-state-agg-table approach planned for Fix #2 of the partition overhead investigation. Built a 4-variant benchmark against ComStock TS-monthly baseline, restrict to CO+NM, no group_by: V1 primary + bs_per_bldg (today's shape) 14.19 s V2 alt + no bs_per_bldg (routing only) 7.28 s <- baseline V3 primary + dedup (dedup only) 22.24 s (worse than V1) V4 alt + dedup (routing + dedup) 5.78 s (-21% vs V2) All four return total_kwh = 21,434,645,727 — identical to 0 ppm. arbitrary() on cross-state duplicates is exact because value columns are bit-identical across the duplicates per the dedup invariant. Headline takeaways: - Routing alone (V2) gets ~2× over V1, with ALL the savings in planning time (7.2 s -> 318 ms). The small-files-pathology that motivates this entire investigation disappears completely on the 62-file alt table. - Dedup on the slow primary table (V3) is worse than doing nothing. TS-side arbitrary() shuffle costs more than it saves when the rest of the query already pays the planning tax. - Dedup on the alt table (V4) wins another 21% over V2, purely on engine time (5.54 s vs 7.04 s). Real but modest. Decision: ship routing only (Pieces A+B+C in the plan). Dedup adds non-trivial query-shape complexity (CTE-style WITH clauses, arbitrary wrappers on every value column, single-key joins, separate eligibility check) for a 21% engine-time win on top of routing's 2× planning win. Defer dedup until concrete production workloads sit at the V2 baseline regularly. Adds benchmark_dedup.py + dedup_benchmark.json + cached probe receipts as a reproducible record. Updates INVESTIGATION_partition_overhead.md with the benchmark findings so future readers can revisit the decision.
Routing (Pieces A+B from the partition-overhead investigation): - Schema model gains TableSuffix.annual_and_metadata_state_agg and UniqueKeys.metadata_state_agg (both optional). Comstock unified TOML declares the alt _md_agg_national_parquet table. - query_core._initialize_tables loads the alt table; _routing_context context manager swaps bs_table/md_table/md_key/sample_wt for one _query call. _get_unique_keys auto-routes to metadata_state_agg keys via table identity. - main._pick_metadata_table selects "primary" or "state_agg" from group_by/restrict eligibility. Annual queries route; TS-flow stays on primary (alt table's in.state vs primary's state column-name asymmetry blocks cross-name joins). - _get_restrict_clauses/_inject_propagated accept bs_table=, restrict candidate_tables to (bs_table,) when routed so 'state' doesn't bind to the TS partition column. model_count column (Piece C): - New aggregate column count(distinct bldg_id) added alongside the metadata-row count at every emission site (8 in aggregate_query.py, 1 in main.py:get_distinct_count). Optimization: emits count(*) when post-join is bldg-unique-per-outer-group (~22% cheaper than count distinct on Athena, which doesn't auto-rewrite). Rename sample_count -> metadata_rows_count (Piece C cleanup): - Old name was misleading: it counted metadata rows (tract slices), not distinct simulation models. Renamed at 9 emission sites plus ~120 references in tests/test_invariants.py and tests/test_schema_unique_keys.py. Snapshot regen included: ~1800 cache files churn (sql hash drift + new model_count column). All cache contents validated against existing data via the snapshot harness; pure SQL-shape drift, no real data divergence. Checkpoint commit before unified MD-LEFT-TS rewrite.
Replaces opt-in --include-local flag with env-based gating: tests run locally by default and skip when CI=true (set automatically by GitHub Actions and most CI providers). Removes the friction of remembering a flag for the common laptop case.
The bs_table/md_table/md_key locals in the TS aggregation entry point were leftover from before routing moved into `_routing_context` on the BSQ — nothing reads them, the helpers inherit routing automatically. AnyTableType was imported but never referenced.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Permanently drops support for the legacy 3-table BSB schema (separate baseline / upgrades / timeseries tables) in favor of the OEDI 2-table format (
annual_and_metadata+timeseries). This is the dominant theme of the PR and a hard breaking change — there is no fallback path for the old shape. All query-generation, snapshot fixtures, integrity checks, and tests now assume the OEDI shape.Version bump:
0.3.0→0.4.0.Schema pivot (the breaking change)
md_table(annual + metadata, replaces baseline-and-upgrades) andts_table(timeseries). Thebs_table/up_tablehandles are now thin SA aliases overmd_table, kept where two distinguishable handles are needed for self-joins.QueryCorelevel:bs_table/up_table→md_table,bs_key/up_key→md_key,bs_bldgid_column→md_bldgid_column,_bs_completed_status_col/_up_completed_status_col→_md_completed_status_col,_up_upgrade_col→_md_upgrade_col,_restrict_targets_bs→_restrict_targets_md, plus matching helpers.(db, baseline, upgrades)form is gone —table_nameis nowstr | tuple[str, Optional[str]].MAX(CASE WHEN upgrade=…)+GROUP BYshape (replaces the previous baseline/upgrade self-join), with a CTE → subquery fallback for SQLAlchemy 2.0 literal-bind expansion.upgrade=0filter now threaded through baseline-side joins; baseline rows filtered out of upgrade-side report queries.unique_keysschema, multi-column restrict keys, and full unique-key tuples returned from building-id helpers.Performance work enabled by the pivot
GROUP BY, defer pure-bs enduses to the outerSELECT(constant per building, no need to materialize per-15-min and re-aggregate).join_listjoins into thebs_per_bldgsubquery; fall back to directts ⋈ bs JOINwhenjoin_listis present.bs_restrictinto the innerts ⋈ bsJOIN condition so Athena predicate-pushes into the metadata scan (minutes vs. timeouts on ComStock).Test infrastructure rebuild
test_BuildStockQuery.py/test_BuildStockSavings.py/test_query.pywith snapshot-driventest_query_snapshots.pyand a 2102-linetest_invariants.pythat cross-checks independent code paths (annual vsts_year_collapsevssum(ts_monthly), savings additivity, applied_in semantics, etc.).aggregates_and_savings,restrict_avoid,savings,timeseries,utility, and helper-method snapshots.nbclientwith safety gating.Packaging + tooling (rides along)
setup.pyto PEP 621pyproject.tomlwith hatchling as the build backend.[dependency-groups](install viauv sync --group dev);[project.optional-dependencies.full]retained for consumers.requires-python = ">=3.12").uv.lockcommitted for reproducible dev/CI environments; CI switched touvand to a 3.12/3.13/3.14 matrix.--update-snapshotsemantics tightened so snapshots can't paper over real numeric divergence.Bug fixes uncovered along the way
aggregate_ts_by_eiaidwas binding the eiaid column tomd_tableinstead ofbs_tableafter the collapse.check_ts_bs_integrity,get_distinct_count, and select-from-md_tablecall sites (DUPLICATE_COLUMN_NAME on Athena).applied_onlyfilter on TS query paths was dropping rows it shouldn't have (validated by an xfail invariant flipping green).get_upgrades_csvcross-join,get_upgrade_namesmalformed SQL,get_successful_simulation_countcrash, viz_data monthly-results query path, and several composite-key handling regressions._restrict_targets_mdnow recognizes bare unprefixed column names.get_quartilescorrectly disallowed on TS queries.Test plan
uv sync --group devresolves cleanly on Python 3.12 / 3.13 / 3.14pytest -vvpasses (snapshot tests + invariants)flake8 buildstock_queryis cleanBuildStockQueryagainst a real OEDI ResStock + ComStock workspace (annual, savings, TS, multi-state, composite-key)Migration notes for consumers
table_name=(db, baseline, upgrades)must switch to a single OEDIannual_and_metadatatable name (or(db, md, ts)).bsq.bs_table/bsq.up_table/bsq.bs_keywill continue to work via the alias shims, but the canonical attribute is nowbsq.md_table/bsq.md_key.0.4.0— pin to0.3.xif you still need it.