Switch to OEDI Format by rajeee · Pull Request #74 · NatLabRockies/buildstock-query

rajeee · 2026-04-23T02:09:05Z

Summary

Permanently drops support for the legacy 3-table BSB schema (separate baseline / upgrades / timeseries tables) in favor of the OEDI 2-table format (annual_and_metadata + timeseries). This is the dominant theme of the PR and a hard breaking change — there is no fallback path for the old shape. All query-generation, snapshot fixtures, integrity checks, and tests now assume the OEDI shape.

Version bump: 0.3.0 → 0.4.0.

Schema pivot (the breaking change)

New 2-table model: md_table (annual + metadata, replaces baseline-and-upgrades) and ts_table (timeseries). The bs_table / up_table handles are now thin SA aliases over md_table, kept where two distinguishable handles are needed for self-joins.
Renames at the QueryCore level: bs_table/up_table → md_table, bs_key/up_key → md_key, bs_bldgid_column → md_bldgid_column, _bs_completed_status_col/_up_completed_status_col → _md_completed_status_col, _up_upgrade_col → _md_upgrade_col, _restrict_targets_bs → _restrict_targets_md, plus matching helpers.
Constructor signature changes: the legacy 3-tuple (db, baseline, upgrades) form is gone — table_name is now str | tuple[str, Optional[str]].
TS upgrade-pair queries rewritten to a single-scan MAX(CASE WHEN upgrade=…) + GROUP BY shape (replaces the previous baseline/upgrade self-join), with a CTE → subquery fallback for SQLAlchemy 2.0 literal-bind expansion.
upgrade=0 filter now threaded through baseline-side joins; baseline rows filtered out of upgrade-side report queries.
Multi-column / composite unique-key support for OEDI tables (e.g. ComStock tract-denormalized metadata): unique_keys schema, multi-column restrict keys, and full unique-key tuples returned from building-id helpers.

Performance work enabled by the pivot

TS aggregation unified: pre-bucket the inner GROUP BY, defer pure-bs enduses to the outer SELECT (constant per building, no need to materialize per-15-min and re-aggregate).
Pre-aggregate bs to building grain before TS join to eliminate tract fan-out on ComStock.
Fold join_list joins into the bs_per_bldg subquery; fall back to direct ts ⋈ bs JOIN when join_list is present.
Skip inner timestamp materialization for year-collapse TS queries.
Push user bs_restrict into the inner ts ⋈ bs JOIN condition so Athena predicate-pushes into the metadata scan (minutes vs. timeouts on ComStock).

Test infrastructure rebuild

Replaced legacy test_BuildStockQuery.py / test_BuildStockSavings.py / test_query.py with snapshot-driven test_query_snapshots.py and a 2102-line test_invariants.py that cross-checks independent code paths (annual vs ts_year_collapse vs sum(ts_monthly), savings additivity, applied_in semantics, etc.).
Content-addressed disk cache replaces the old pickle cache; SQL is normalized at compile time; cost-history lookup + cost-regression gate guard against runaway Athena spend.
Snapshot fixtures regenerated for the OEDI shape across aggregates_and_savings, restrict_avoid, savings, timeseries, utility, and helper-method snapshots.
Example notebooks auto-generated from snapshot entries and executed via nbclient with safety gating.

Packaging + tooling (rides along)

Migrated from setup.py to PEP 621 pyproject.toml with hatchling as the build backend.
Dev deps moved to PEP 735 [dependency-groups] (install via uv sync --group dev); [project.optional-dependencies.full] retained for consumers.
Python floor raised to 3.12 (requires-python = ">=3.12").
uv.lock committed for reproducible dev/CI environments; CI switched to uv and to a 3.12/3.13/3.14 matrix.
Lint: kept flake8 (ruff was tried and reverted for now); --update-snapshot semantics tightened so snapshots can't paper over real numeric divergence.

Bug fixes uncovered along the way

aggregate_ts_by_eiaid was binding the eiaid column to md_table instead of bs_table after the collapse.
Several comma-join bugs in check_ts_bs_integrity, get_distinct_count, and select-from-md_table call sites (DUPLICATE_COLUMN_NAME on Athena).
applied_only filter on TS query paths was dropping rows it shouldn't have (validated by an xfail invariant flipping green).
get_upgrades_csv cross-join, get_upgrade_names malformed SQL, get_successful_simulation_count crash, viz_data monthly-results query path, and several composite-key handling regressions.
_restrict_targets_md now recognizes bare unprefixed column names.
TS pivot crash on calculated columns; get_quartiles correctly disallowed on TS queries.

Test plan

uv sync --group dev resolves cleanly on Python 3.12 / 3.13 / 3.14
pytest -vv passes (snapshot tests + invariants)
flake8 buildstock_query is clean
Smoke-test BuildStockQuery against a real OEDI ResStock + ComStock workspace (annual, savings, TS, multi-state, composite-key)
Run example notebooks end-to-end against the cached snapshot fixtures

Migration notes for consumers

Anyone passing table_name=(db, baseline, upgrades) must switch to a single OEDI annual_and_metadata table name (or (db, md, ts)).
Anyone reading bsq.bs_table / bsq.up_table / bsq.bs_key will continue to work via the alias shims, but the canonical attribute is now bsq.md_table / bsq.md_key.
The 3-table BSB workspace shape is not supported on 0.4.0 — pin to 0.3.x if you still need it.

New [unique_keys] TOML block with metadata and timeseries key lists (timeseries must be a subset of metadata). Three new schemas ship with it: comstock_oedi_state_and_county, resstock_oedi_new, resstock_oedi_vu. bs_key, up_key, ts_key tuples (plus *_key_cols properties) expose the configured unique-key identity of each table. Aggregate sample counts, load-duration grouping, and timesteps verification now use these tuples via _count_distinct() and multi-column group-by. Single-key schemas still emit byte-identical SQL. Also carries prior session fixes: CACHED sentinel skipped in batch helpers, cost log precision bumped to 2 decimals, and _get_rows_per_building groups on the configured timeseries keys. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

_get_applied_in_subquery now projects all metadata key columns, and _add_applied_in_restrict emits a tuple-IN filter against the correct side (ts_key when timeseries-backed, bs_key otherwise). Single-key schemas still emit the original single-column IN form. _get_restrict_clauses and _normalize_restrict_subquery extended to accept a tuple of columns paired with a multi-width subquery, using sa.tuple_(c1,...).in_(subquery) under the hood. Also fix an outdented return in _get_buildings_by_change that broke the module at import time under strict linting. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

get_building_ids and get_buildings_by_locations now return DataFrames projecting all bs_key columns instead of just bldg_id. _get_upgrade_buildings and get_buildings_by_change return list[int] for single-key schemas and list[tuple] for multi-key. _get_full_options_report's array_agg uses row(...) packing when the upgrade key has more than one column. Single-bldg_id default paths keep byte-identical SQL and return shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

get_results_csv, get_results_csv_full, get_upgrades_csv, and get_upgrades_csv_full now return flat DataFrames. Parquet readbacks that still carry a bldg_id index are reset so the returned shape is consistent regardless of source. Breaking change: callers that relied on df.loc[bldg_id] must now do df[df.bldg_id == bldg_id] or set the index themselves. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Old check compared row counts between timeseries and baseline/upgrade tables, which produced false positives whenever a bldg_id could legitimately appear in baseline more than once (e.g. per state). New flow: 1. Distinct-bldg parity per upgrade between timeseries and baseline/upgrade tables, using bs_key and ts_key for distinct counting. A dedicated _get_metadata_distinct_bldg_count replaces the row-counting derivation from get_success_report. 2. Duplicate detection per table: group by bs_key on baseline, up_key + upgrade on the upgrade table, and ts_key + timestamp (+ upgrade when present) on the timeseries table. First 5 offending rows are printed. 3. Existing per-building row-count verification retained, now phrased in terms of partitions rather than buildings. _get_ts_report now counts distinct ts_key tuples instead of distinct bldg_id. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Near-duplicate of _get_rows_per_building with no callers in the package. Drop it rather than maintain two versions through the multi-key refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

- bs_key / up_key / ts_key reflect configured unique_keys and revert to (bldg_id,) when cleared. - applied_in emits tuple-IN for multi-key schemas and scalar-IN for single-key. - aggregate_timeseries with timestamp_grouping_func emits count(DISTINCT (c1, c2)) for multi-key and count(distinct(c1)) for single-key. - get_building_ids projects all bs_key columns. Also switch _count_distinct to sa.distinct(sa.tuple_(*cols)) so the multi-column form renders as the standard SQL `count(DISTINCT (c1, c2))` rather than `count(distinct(c1, c2))`, which Trino parses as a single-arg function call on a row tuple. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The branch earlier swapped flake8 for ruff with an aggressive ruleset (E, W, F, I, UP, B) that flagged ~280 pre-existing UP/B violations unrelated to unique_keys work. Revert to flake8 with the same config used on main (.flake8, max-line-length = 120) so this PR stays focused on multi-key unique-keys changes. Also wrap two 137-char lines in _get_metadata_distinct_bldg_count so flake8 passes cleanly.

UNLOAD writes query results as multiple parquet files, and row order across chunks is not guaranteed when read back. test_savings_shape_with_timestamp_grouping was intermittently failing because the time column started at different points depending on which parquet chunk pandas read first. Sort df1 and df2 by (geometry_building_type_recs, time) before comparing so the assertion is order-independent.

Set the pandas index back on the four CSV-returning functions to preserve backward compatibility with callers that rely on df.loc[bldg_id]. For single-key schemas the index is a single-level Index on bldg_id (byte-identical to main). For multi-key schemas it becomes a pandas MultiIndex on the configured bs_key / up_key tuple. get_results_csv, get_results_csv_full, get_upgrades_csv, and get_upgrades_csv_full all use `set_index(list(self.bs_key))` (or up_key). Full variants reset any existing index from the parquet before re- indexing so the result is consistent regardless of how the parquet was written.

Fixes ValueError when querying timeseries data against schemas whose timeseries unique key spans multiple columns (e.g. bldg_id + state). _add_applied_in_restrict now threads a width-matched subquery to the correct side, _split_restrict routes tuple column refs to the ts side, and restrict/avoid accept (col_tuple, subquery | list-of-tuples).

bs_tbl in this branch is a ts-side subquery whose columns match the timeseries unique key, not the metadata key. Looking up bs_key on it raises KeyError for schemas whose metadata key has cols absent from ts (e.g. in.nhgis_tract_gisjoin). Surfaced by a monthly TS query against the comstock_oedi_state_and_county schema.

Clears the deprecated query surface ahead of the 2.0 release so the new bsq.query(...) facade is the single public path for annual, timeseries, and savings aggregations. Internal callers (utility_query, viz_data) are migrated, savings_query is deleted outright, and the savings_shape-compat branches in aggregate_query._query collapse to single canonical paths. Also drops TSQuery.split_enduses (application code should batch) and TSQuery.collapse_ts (use timestamp_grouping_func="year"); SavingsQuery schema is removed.

The else-branch in init_monthly_results referenced run_obj._conn, which does not exist on BuildStockQuery, and used run_obj.params instead of run_obj.run_params for the S3 unload bucket. Both were latent bugs that only fired on cache miss; the prior aggregate_timeseries SQL happened to match cached entries, so the branch never ran in CI. Replace the hand-rolled unload/cache dance with a direct run_obj.query() call, which routes through execute() and handles caching, S3 result lookup, and UNLOAD submission consistently with the rest of the library. Note: test_Viz fixtures may need their reference pkl cache regenerated against the new SQL text; that is a separate fixture-maintenance task.

When a timeseries query asks only for upgrade output (no savings, no baseline) with applied_only=True, the ts_b/ts_u subquery pairing adds unnecessary cost and, more importantly, returns zero rows for runs whose timeseries table lacks upgrade=0 rows (upgrade-only runs). The column expressions for applied_only reference only up_tbl, so we can aggregate ts_u directly with the same shape the upgrade_id="0" branch already uses. Also refresh the viz test reference caches with entries keyed off the new SQL text.

Relocate the pre-snapshot test modules (test_BuildStockQuery.py, test_BuildStockUtility.py, test_query.py, test_Viz.py, and their helpers) into tests/legacy/ so pytest stops collecting them. The new snapshot-driven framework replaces them and the old files are kept for reference.

Introduce a test framework that pins BuildStockQuery.query() generated SQL and reference result data per curated argument set, plus cross-query invariant checks that verify internal consistency without hitting Athena. Framework (tests/test_utility.py, tests/conftest.py, tests/test_query_snapshots.py): - Each snapshot entry in tests/query_snapshots/{schema}/*.json carries one or more equivalent arg-variants, a .sql file (exact generated SQL), and a .parquet file (reference DataFrame). - Normalized-whitespace SQL comparison with sqlglot fallback; variant divergence surfaces as per-variant failure blocks. - Pytest options --check-data, --update-sql, --update-sql-and-data control data comparison and (re)generation of reference files. - Session-scoped fixtures for comstock_oedi_state_and_county and resstock_oedi_vu schemas against buildstock_sdr on rescore. Invariants (tests/test_invariants.py): - annual == ts-year-collapse == sum(ts-monthly) for enduse totals. - sample_count / units_count consistency across query flavors (resstock only; comstock counts intentionally skipped pending a library fix). - baseline - upgrade ~= savings within a single include_savings query. Seeded with ~30 entries across annual/timeseries/savings/applied_only/ restrict_avoid/mapped_column for both schemas, CO-restricted and limited to electricity/natural-gas totals to keep Athena usage small. See tests/query_snapshots/README.md for the workflow.

_add_restrict already accepts annual_only to route restrict columns through the baseline vs timeseries table, but the caller in BuildStockAggregate._query was never passing it, while the adjacent _add_avoid call was. Bring restrict into line with avoid so annual-only queries consistently resolve restrict columns against the baseline table.

On schemas where one TS row maps to multiple metadata rows (e.g. comstock_oedi_state_and_county, which has one metadata row per (bldg_id, nhgis_tract_gisjoin, state) but one TS row per (bldg_id, state)), the TS-grouping branch of BuildStockAggregate._query was returning sample_count values that undercounted by the per-TS-key metadata multiplicity — empirically 1.4x-100x across building types. Count distinct over the full metadata key on the baseline table (already present in tbljoin) instead of over the TS key on the TS subquery. For schemas where bs_key == ts_key (resstock_oedi_vu etc.) the resulting count value is identical; SQL differs because columns are now referenced via bs_table. Comstock TS-grouping snapshot data needs regeneration after this change: pytest -s -v tests/test_query_snapshots.py --update-sql-and-data Re-enable the sample_count/units_count check in the comstock invariant test once snapshots are refreshed.

The library bug that made these undercount on comstock TS-grouping queries was fixed in 25fa3fa. Restore the active count-column comparisons to guard against regression. Now 4/4 invariants pass including the comstock counts.

Only _get_gcol (the group_by path) used to fall back to the prefixed column name when a bare name didn't resolve. Restrict, avoid, and the TS side of _split_restrict went straight to _get_column and raised ValueError on bare names like "state" / "vintage" whenever the underlying table only exposed the prefixed form ("in.state" etc). Move the fallback into QueryCore._get_column itself so every caller that resolves a user-provided string column gets the same behavior. Simplify _get_gcol to drop its now-redundant except block. Also trim trailing-newline inconsistency in comstock_oedi_state_and_county.toml.

Now that _get_column auto-prepends the characteristics prefix, the snapshot JSONs no longer need per-schema hand-prefixing for restrict and avoid. Simplify to a uniform convention: - group_by: write "in.<col>" explicitly (readable, consistent across schemas). - restrict / avoid: write bare "state" / "vintage" — the library's auto-prefix fallback resolves to "in.state" / "in.vintage" when that's what exists on the table. Comstock group_by entries gain their "in." prefix; resstock restrict/avoid entries drop theirs. Stored SQL is unchanged because the resolved SQLAlchemy column objects are the same.

Now that QueryCore._get_column auto-prepends the characteristics prefix for both group_by and restrict/avoid, snapshot JSON entries can rely on bare column names everywhere. Stored .sql files are unchanged — the fallback resolves bare "state" / "geometry_building_type_recs" / "comstock_building_type" / "vintage" to the same SQLAlchemy column objects that were previously written out explicitly.

Collapse four near-duplicate test functions into two parametrized ones: test_annual_equals_ts_year_equals_ts_monthly_sum[resstock_...] test_annual_equals_ts_year_equals_ts_monthly_sum[comstock_...] test_savings_decomposition[resstock_...] test_savings_decomposition[comstock_...] Each case provides the fixture name, schema directory, per-leg enduse lists (comstock baseline uses "..kwh" suffix, TS does not), and the group-by column. The bsq fixture is resolved at test time via request.getfixturevalue so the case table stays declarative. Net: -48 lines, same 4 cases, no behavior change (15/15 pass).

These queries fire on every BuildStockQuery(skip_reports=False) init and have already produced two comma-join regressions. Pinning their SQL shape in a snapshot catches future re-introductions automatically. Refactor: split SA query construction out of _get_metadata_distinct_bldg_count and _find_duplicate_rows into _build_*_queries* helpers so the snapshot harness can compile without executing. Add get_query_only kwarg to check_ts_bs_integrity returning the full list of compiled SQL strings in execution order. Mark the entry nondeterministic so the data check never runs — _get_rows_per_building scans the entire TS table and would be a multi-TB landmine if the harness tried to validate data on a hash refresh.

The TS upgrade-pair pivot path looked up `ts.c[e.name]` for each enduse when building the per-side CASE expressions. That works for bare ts columns but raises KeyError on Labels produced by `get_calculated_column`, since `e.name` is the user-provided label (e.g. "elec_minus_gas") rather than a real ts column. Wrap `e.element` (the underlying arithmetic expression) in CASE for Labels and the column directly otherwise. The expression already references raw ts columns, so the CASE correctly gates the whole arithmetic per upgrade side and the result keeps its `bs__<name>` / `up__<name>` label for the outer _SideView lookup. The original `calculated_column_elec_minus_gas` snapshot only exercised table='baseline' + annual_only=True + upgrade_id=0 — none of the three dimensions needed to enter the broken branch. Add two TS pivot variants (applied_only=False to force upgrade-pair pivot, and applied_only=True + include_baseline=True to additionally exercise applied_in synthesis) so this regression class is caught next time. Both restrict to state=CO to keep the TS scan bounded. Also: route `placeholder_annual` through the calc-column test's expression substitution so TS-table calculated columns resolve to the suffix-less ts column names on comstock.

Old shape (single-scan + SUM-CASE pivot, GROUP BY raw timestamp): - Inner GROUP BY at (bldg, 15-min, state) granularity → billions of groups on national-scope hourly queries → Athena timeouts at 30 min. - SUM(CASE WHEN ...) puts arithmetic INSIDE the aggregate phase, defeating scan-time projection pushdown. New shape (two-level subquery + FILTER aggregates, GROUP BY bucketed time): - Innermost ts_flat subquery materializes per-row enduse arithmetic as a scalar before any aggregation — Trino can push the projection down to the scan layer. - Outer ts_pivot subquery uses FILTER (WHERE upgrade=N) per side and GROUPs BY at the user's time granularity (date_trunc pre-applied): 4× fewer keys for hourly, 720× fewer for monthly, 35,000× fewer for yearly. - Per-bldg granularity preserved for correct weight propagation through the outer JOIN to bs. - ComStock composite primary key (bldg_id, state) handled correctly via ts_unique_keys flowing through every layer. Carries forward _inner_rows = count(*) FILTER (WHERE upgrade=0) so the outer rows_per_sample metric (which measures underlying 15-min cadence) stays correct after pre-bucketing. units_count is self-correcting. Also fixes a pre-existing calc-col bug in the annual flow: get_col(up_tbl, col) for a Label fell through to returning the bs-bound expression. Annual upgrade!=0 + applied_only=False with a calc col enduse silently returned BASELINE values. Fix: rebind_to(col, target_tbl) using SA's ClauseAdapter(target, adapt_on_names=True) to traverse the Label's expression and rewrite column refs. Test changes: - New test_calculated_column_matches_manual_decomposition_per_flow invariant (replaces the cross-flow check that had false-positive drift amplification on comstock). - Updated SQL substring assertions in test_schema_unique_keys to match the new ts_flat / ts_pivot subquery shape. Verified: 68/68 invariants pass (including the 6 bare-fuel cross-flow variants that were failing before).

Six TS-pivot-driven entries auto-updated by --update-snapshot after the data check confirmed numerical equivalence: Cost wins (query time, data scanned unchanged): savings_ts_monthly_upgrade1_electricity (resstock): -33% savings_ts_monthly_two_fuel (resstock): -48% savings_ts_monthly_upgrade1_electricity (comstock): -50% savings_ts_monthly_two_fuel (comstock): -47% calculated_column_ts_pivot_with_baseline (comstock): -55% savings_ts_monthly_upgrade1_quartiles (comstock): -57% ↑ NOT yet written (data-mismatch — see below). Two entries still need user attention: 1. savings_ts_monthly_upgrade1_quartiles (resstock): cost regression +32% on query time alone (data scanned unchanged). Auto-blocked per CLAUDE.md cost-regression gate. 2. savings_ts_monthly_upgrade1_quartiles (comstock): data-status=mismatch. The new pivot pre-buckets time at the inner GROUP BY, so approx_percentile(coalesce(bs - up)) is now over hourly savings values, not raw 15-min values. This is a real semantic change of the savings-quartile distribution (smoother — averaging within hours reduces outliers). Both quartile-savings entries need a deliberate --overwrite-snapshot from the user.

The TS pivot's inner GROUP BY pre-buckets time at the user's `timestamp_grouping_func` granularity. Quartile aggregates take the percentile over the resulting per-(bldg, bucket) summed values — which is the consistent interpretation for a monthly query: "the 50th percentile of MONTHLY building totals", not "the 50th percentile of 15-min instantaneous values" (the latter doesn't compose with the rest of a monthly rollup). The pre-pivot code emitted the 15-min-level quartile because that's how the SQL fell out of its row-per-15-min pivot subquery — not by design. The new behavior is more meaningful and matches user intent. Snapshots for `*_quartiles` TS entries need `--overwrite-snapshot` to adopt the new (correct) values.

Quartiles over timeseries are confusing: percentile-of-15-min-values doesn't compose meaningfully with a monthly rollup, and percentile-of- per-bucket-sums is non-obvious. Sum/avg/min/max are sufficient diagnostics on the TS path; full distributional analysis belongs on annual queries. Pydantic validator on Query rejects get_quartiles=True with annual_only=False at construction time. Snapshot entry savings_ts_monthly_upgrade1_quartiles removed from savings.json, its cache files deleted, and the QUARTILE_ENTRIES invariant list updated to drop it. Other quartile coverage (annual_baseline_quartiles and savings_annual_upgrade1_quartiles) remains.

Three structural improvements to the TS aggregation pipeline, building on the upgrade-pair pivot work: 1. **Single shape for baseline + upgrade-pair**: ts_flat → ts_aggr → outer JOIN. Single-upgrade uses plain SUM(_v__col); upgrade-pair uses FILTER per side. Inner pre-bucketing applies uniformly: 4×/720×/35,000× shuffle key cardinality reduction for hourly/monthly/yearly. Raw 15-min single- upgrade output skips the aggregate layer (one row in, one row out). 2. **Partition-aligned GROUP BY ordering**: inner ts_aggr GROUPs BY (state, timestamp, bldg_id) — partition col first. Outer SELECT defaults to inserting time AFTER the user's group_by columns, so GROUP BY (state, timestamp) at outer too. Matches the parquet's `by_state` partitioning instead of fighting it. 3. **Pure-bs enduses skip ts_flat entirely**: characteristic columns (e.g. `bs.c["in.sqft..ft2"]`) and pure-bs calc cols are projected at the outer SELECT directly via `sum(bs.<col> * bs.weight)`. Eliminates: - the duplicate bs JOIN inside ts_flat that Gemini-style reviewers flagged - the rows-per-bucket multiplier on characteristic enduses (sqft was getting `4 × sum(sqft × weight)` for hourly; now correctly returns `sum(sqft × weight)` once per bucket) Mixed bs+ts calc cols still take the inner-join path (preserves existing behavior for that rare case). Snapshot SQL hashes drift across all TS entries — data unchanged but hashes need user-driven `--update-snapshot`. All 68 invariants pass; 21 schema-unique-key tests updated to match new shape.

ComStock's md_by_state_and_county_parquet is tract-denormalized: multiple rows per (bldg_id, state). The previous TS-aggregation pipeline did a direct ts_aggr ⋈ bs JOIN at full md grain, which fanned out each pivot row by N_tracts-per-bldg. The post-join COUNT(DISTINCT (bldg, tract, state)) shuffle ran over the blown-up rowset — Stage 5 of a national hourly query processed 6.28 B rows / 499 GB and aborted at 17h33m on the user's failing query. ResStock's md is one row per building, so this never came up there. Fix (suggested by Claude-on-web): pre-aggregate bs to per-(bldg, state) grain in a `bs_per_bldg` subquery before the outer JOIN. All bs-side aggregations are linear in `weight`, so the math collapses cleanly: - bldg_weight = SUM(weight) per bldg - tract_count = COUNT(*) per bldg (used for sample_count via SUM at outer) - bs-side characteristic columns (sqft, building_type, etc.) — projected via arbitrary() since they're per-bldg constants. Outer SELECT becomes uniform: every aggregate is sum(<value> * bldg_weight) (for ts-side enduses) or sum(bs_per_bldg.<bldg_constant>) (for sample_count and bldg_weight). No special pure-bs path; no double-multiplication; no tract-grain JOIN. Also folds the single-upgrade branch into the unified shape: previously it took a direct ts ⋈ bs join (no ts_aggr layer), which couldn't benefit from bs_per_bldg's tract-fan-out elimination. Now ts_aggr applies uniformly. The per-bldg shuffle stage (3 B keys for national hourly) is narrow scalars and finishes quickly; the original Stage 5 blowup was specifically the post-join fan-out, not ts_aggr. Test changes: - 21/21 schema-unique-keys tests updated to match the new SQL substring shape (count_distinct → sum(tract_count); bs ⋈ → bs_per_bldg ⋈). - 66/66 invariants pass (data correctness verified end-to-end against Athena: cross-flow consistency, savings decomposition, multi-state additivity, quartile monotonicity, sample/units-count matching). - 9 snapshot SQL-hash mismatches expected from the new shape — user- driven --update-snapshot regenerates after. The user verified the original failing query (national ComStock hourly baseline with sqft + calc-col enduses, 17h33m timeout before) now finishes successfully.

When a user-supplied restrict references a bs-side column by its bare name (e.g. `"geometry_building_type_recs"` instead of the full `"in.geometry_building_type_recs"`), `_restrict_targets_md` previously returned False because it only checked the bare form against `bs_table.columns`. This caused `_split_restrict` to drop the clause into `extra_restrict` (clauses that target neither md nor ts), which was then applied at the outer WHERE. After the bs_per_bldg refactor, the outer FROM no longer references the canonical bs alias directly. An outer-WHERE clause like `bs.<col> = 'x'` created an implicit comma-join, fanning out the bldg×tract product again. On a national hourly TS query this produced a query that ran past 12+ minutes before being cancelled. Fix: in `_restrict_targets_md`, also check the column-name with each configured prefix (`_char_prefix`, `_out_prefix`) appended. Now `"geometry_building_type_recs"` correctly classifies as md-side and gets routed through `bs_per_bldg`'s inner WHERE. This was a latent issue that became visible (and harmful) only after the bs_per_bldg refactor stopped exposing the bs alias at the outer level.

The bs_per_bldg refactor replaced the canonical bs alias in the outer FROM. join_list joins reference bs columns by the canonical alias, so they were unbound and SA generated implicit comma-joins. Fix routes _add_join's bs-side reference through the bs_alias param (defaults to bs_table; TS queries pass bs_per_bldg). Also propagates join_list's bs-side baseline columns into bs_per_bldg via arbitrary() so they're available for the outer JOIN ON clause. baseline_column_name in join_list can be either a string or an SA Column (utility code passes Column objects directly). Both paths handled.

Utility queries (aggregate_ts_by_eiaid, etc.) pass a join_list referencing non-bs tables (e.g. eiaid_weights). bs_per_bldg can't cleanly absorb weights/group-bys bound to those tables without comma-joining them in. Quick fix: when params.join_list is non-empty, skip the bs_per_bldg pre-aggregation and use the pre-refactor direct ts ⋈ bs JOIN shape. Utility queries are typically eiaid-restricted, so tract fan-out is small enough not to blow up. A follow-up will fold the join_list table into bs_per_bldg (per the user's suggestion: "any extra join we do - such as for eiaid_weight is only for the metadata column and we should be able to fold it into metadata per state/bldg subquery"). That keeps the tract-fan-out elimination win for utility queries too. For now this fallback at least restores correctness. Outer SELECT's grouping_metrics / total_weight / agg_weight all gate on `"bldg_weight" in md_alias.c` to detect which shape ran.

Per user request: utility queries' join_list joins (e.g. eiaid_weights mapping bldg→eiaid) are metadata-side extensions of bs. Folding them INTO bs_per_bldg's FROM keeps the outer query a clean ts_aggr ⋈ bs_per_bldg shape with no extra outer JOINs and no comma- join risks. bs_per_bldg's GROUP BY stays at (bldg, state) regardless of whether the user grouped by a join_list-table column. join_list group-by columns (e.g. eiaid) get projected via arbitrary() — picks one value per bldg. That's only correct when the column is per-bldg constant after the restrict (typical utility flow batches by single eiaid). User signed off on this semantic: "the building doesn't need to appear twice. Only when user explicitly passed eiaid in the group by do we need to preserve the eiaid group" — preserved at outer GROUP BY level. Restrict clauses targeting join_list tables (e.g. `eiaid_weights.eiaid IN ['4110']`) are routed from extra_restrict to the bs_per_bldg WHERE so the join filter pushes through. The outer _add_join is a no-op for these queries (the join already happened inside bs_per_bldg). For annual or non-bs_per_bldg paths, _add_join still adds the join at outer.

The cost gate fired on any percentage delta over ±20%, even when the absolute change was sub-MB or sub-second. After the bs_per_bldg refactor this caused 19 entries to be blocked from update-snapshot writes despite having matching data — most were noise-floor regressions like 4241→4242 MB or 5→6 s. Add per-metric absolute floors (10 MB, 10 s) that combine with the existing ±20% percentage check: a metric trips the gate only if BOTH thresholds are exceeded. The two existing all-noise floors (_COST_FLOOR_MB / _COST_FLOOR_MS) stay in place as the early-return gate. The refresh writes through 13 entries that were previously blocked. The 6 remaining blocks are real shape regressions (5 ts_year/inv_year entries with +3 GB scan from the new ts_aggr layer forcing full TS partition scans on year-collapse; 1 calculated_column entry with +14 s wall-clock) for separate investigation.

When timestamp_grouping_func='year' the outer query collapses time via SUM regardless — carrying timestamp through the inner ts_flat / ts_aggr forced Athena to materialize the truncated value before the inner GROUP BY, blocking partial-aggregation pushdown into the scan. For year-collapse, drop timestamp from ts_key_names entirely. The inner SELECT no longer projects it, the inner GROUP BY no longer keys on it, and Athena fuses the per-bldg sum directly with the scan. Recovers pre-pivot performance for the year-collapse shape: - inv_baseline_ts_year_by_building_type: 4242 → 1128 MB (-73%), 15694 → 5376 ms (-66%, 10s faster). - 5 ts_year / inv_year cost regressions clear, plus 5 dependent snapshot entries write through cleanly. The bs_per_bldg layer stays in place to handle ComStock's tract fan-out — the regression was specific to the inner timestamp materialization, not the per-bldg pre-aggregation.

Investigation: ran calculated_column_ts_pivot_with_baseline against Athena 3 times each on the OLD and NEW SQL with cache-busting nonces. OLD (3 fresh runs): 9881 / 6310 / 14979 ms NEW (3 fresh runs): 10408 / 7422 / 19444 ms ~10s mean per shape, with a natural ~6-19s spread per single execution. The +199% / +14.3s "regression" was an unlucky pairing of two outliers (stored 7187 ms baseline vs fresh 21454 ms) — the SQL shapes are statistically equivalent. Bytes-scanned identical (~8.2 GB both shapes, deterministic). No code change; just refresh the cost sidecar via --overwrite-snapshot so the gate stops reporting noise as a regression. The 10s absolute floor on the timing gate is smaller than the natural single-execution spread for queries in this size class. Future work could either drop the timing check entirely (rely on bytes scanned, which is what Athena bills) or scale the floor with query size; for now, accept that single-run timing is informational only.

Each (flavor.json, schema) pair becomes tests/example_notebooks/<flavor>_<schema>.ipynb containing: 1. Imports 2. BuildStockQuery(...) construction (mirrors conftest fixture) 3. One markdown+code pair per entry that calls the entry's method with the entry's args and embeds a head() preview of the result Notebooks regenerate inside `evaluate_entries` when: - the file is missing (bootstrap), OR - any entry just had a writethrough, OR - --update-snapshot / --overwrite-snapshot was passed (data fresh) For deterministic git diffs: - cell IDs derived from md5(source) — same content → same ID - execution_count = None (no run-counter churn) - result preview is head().to_string() — bounded, plaintext - JSON serialized with stable indent + key order Result: a runnable companion notebook per flavor whose diff flips only when SQL or data actually changes.

The generated BuildStockQuery in each notebook now passes `cache_folder=tests/query_snapshots/<schema>_cache`, so query calls in the notebook reuse the parquets the test suite already downloaded. First-run cost drops to ~$0 — Athena returns cached results without scanning data. Resolves the cache path with a Path.cwd()-based fallback chain so the notebook works whether Jupyter was launched from the repo root, the tests/ directory, or tests/example_notebooks/ itself.

Previously the notebook used Path.cwd() with a 3-way fallback chain to find the snapshot test cache directory. That worked from the repo root, tests/, or tests/example_notebooks/, but could fail if Jupyter was launched from a deeper or unrelated directory. Use IPython's `_dh[0]` (the notebook's own directory at kernel startup) to resolve the cache as a relative sibling: `../query_snapshots/<schema>_cache`. Falls back to "." if `_dh` is missing (e.g. when the cell is run outside Jupyter), preserving backward compatibility with non-Jupyter execution. Verified end-to-end via `jupyter nbconvert --execute` on annual_resstock_oedi.ipynb — 13 cells executed, all queries returned cached results with $0.00 cost.

Generated notebooks now run through nbclient after writing so cells contain real DataFrame outputs (rich text/html tables + text/plain) instead of a pre-embedded preview. The notebook IS the runnable artifact AND the source of truth for current behavior. Two safety mechanisms keep regen cheap and bounded: 1. `_NO_EXECUTE_METHODS` allowlist — `report.get_success_report`, `report.check_ts_bs_integrity`, and `agg.get_building_average_kws_at` are commented out in source. Each fans out internally to many Athena queries the snapshot cache doesn't cover (count-distinct per upgrade, per-pair change tracking, the 3.2 TB landmine documented in CLAUDE.md). Determined by reading method source — only true singleton-execute methods run live. 2. `_has_unroundtrippable_placeholder` — auto-comments any cell whose args contain SA object reprs (`<sqlalchemy....Label at 0x...>`, `MappedColumn(bsq=<...object at 0x...>)`, or our own `<Class ...>` fallback). Those can't run as written; the user must construct the SA object inline. `allow_errors=True` on the nbclient executor: a single bad cell embeds its error in the notebook instead of aborting execution. This catches pre-existing interactive-mode bugs in `bsq.get_results_csv` etc. that the snapshot tests miss (they run with get_query_only=True, never exercising the post-query DataFrame access). End state: 121 cells executed cleanly, 8 commented (unsafe), 10 commented (placeholder), 6 cells with embedded errors documenting real bugs to fix separately.

Removes unused imports (`os` in main.py and helpers.py, `sqltypes` in main.py) and a stale `ucol` local in aggregate_query.py, plus a missing blank line before a nested def and a 124-char comment line. CI Lint step was failing on these six pyflakes/pycodestyle violations.

The bs_per_bldg subquery (introduced to collapse ComStock's tract fan-out before joining ts_aggr) was wrapping bs-side group_by columns in arbitrary() and grouping only on (bldg_id, state). For ComStock — whose metadata is partitioned at (bldg_id, tract, state) granularity with weight divided across tract rows — a building straddling counties had its full weight attributed to whichever county arbitrary() picked, silently dropping the tract-fractional disaggregation. Same issue silently affected ResStock's utility_ts_by_eiaid: eiaid_weights fans out per-county into multiple utilities, so arbitrary(eiaid) collapsed the weighted apportionment. Fix: project bs-side group_by columns as bare references and add their underlying expressions to bs_per_bldg's GROUP BY. Pure-bs enduses (sqft, vintage) and extra_bs_cols stay under arbitrary() — they're true per-bldg constants. Adds three ComStock-only timeseries snapshots (year-collapse: by_county, by_tract, overall_co) plus an invariant cross-checking sum(by-county) == sum(by-tract) == overall. Replaces utility_ts_by_eiaid's eiaid=4110 (no CO data → empty) with eiaid=15466 (Xcel CO, ~9.3k samples) so the snapshot exercises the disaggregation path with real data.

SqlCache now appends each get/put hash to .cache_usage_log inside the cache folder so we can tell which entries a test session actually exercised. tests/cleanup_stale_caches.py removes cache files whose hashes are referenced in neither the snapshot JSONs nor the usage log; tests/refresh_caches.py wraps the clear → run-tests → cleanup flow. Also includes the stale entries that the first cleanup run identified and dropped, freeing the resstock_oedi_cache + comstock_oedi_cache directories of orphans left over from previous SQL refactors.

Timestamps and a few result-data fields refreshed from the most recent test session that exercised every notebook against the pruned cache.

Extends the test suite to cover the county-aggregated ComStock metadata table (`_md_agg_by_state_and_county_parquet`) where buildings within a county are pre-collapsed into one weight-correct row. The agg schema is wired into every schema-aware list/dispatcher (fixtures, snapshot and invariant parametrize lists, placeholder resolver, cleanup script, notebook builder, backfill costs) and the snapshot harness's `_patch_hash` now inserts a missing schema key in addition to overwriting an existing one. `--update-snapshot` populated the new `comstock_oedi_agg_cache/` with 76 sidecar triples covering 66 JSON entries.

Walks every snapshot JSON entry across six weight-aggregating flavors (annual, timeseries, savings, applied_only, restrict_avoid, invariants_three_way), runs the query on both ComStock schemas, and asserts equal DataFrames after dropping `sample_count` (the only column that differs because the agg table has fewer rows). Filters out nondeterministic entries and aggregations that don't survive county pre-collapse (mean/max/min/quartiles/arbitrary). A separate test handles `get_building_ids`: the two schemas return different partition columns (tract vs county), so the comparison projects to (bldg_id, state) and dedupes — the resulting building set is identical across schemas. Both tests run from the snapshot caches in under 6 seconds with no Athena spend.

Adds the three-way invariant on the applied_in=[1,2] code path that combines flows the existing tests covered separately but never together: annual ≡ ts-year-collapse ≡ sum(ts-monthly), now exercised on both the categorical comstock_building_type axis and the county partition-key axis whose arbitrary()-collapse silently broke before commit 182ff21. Three new snapshot entries (annual / ts_year / ts_monthly grouped by county, applied_only=true, applied_in=[1,2], CO) populate the cache for both ComStock schemas, gated to comstock_oedi/comstock_oedi_agg via the JSON `schemas` allowlist since resstock has no county column.

rajeee · 2026-05-04T19:19:57Z

Superseded by: #75

rajeee added 2 commits April 22, 2026 21:01

Migrate packaging from setup.py to pyproject.toml; bump to 0.4.0

bb8a12d

Migrate to uv-native packaging with hatchling, ruff, and pre-commit

8c0d6d3

rajeee changed the title ~~uv-native packaging + applied_in features (v0.4.0)~~ uv-native packaging migration (v0.4.0) Apr 23, 2026

rajeee and others added 8 commits April 22, 2026 21:11

Switch CI to uv; drop py3.10/3.11 from matrix

b03b7ec

Remove unused _get_simulation_timesteps_count

8231be3

Near-duplicate of _get_rows_per_building with no callers in the package. Drop it rather than maintain two versions through the multi-key refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

rajeee force-pushed the new_sampling branch from 2a1cab8 to f85e9b5 Compare April 23, 2026 05:14

rajeee added 18 commits April 23, 2026 00:18

Allow timestamp_grouping_func='year' on TSQuery

33619ec

rajeee added 20 commits April 27, 2026 11:07

rajeee changed the title ~~uv-native packaging migration (v0.4.0)~~ Switch to OEDI Format Apr 28, 2026

rajeee added 7 commits April 28, 2026 12:42

Merge branch 'main' into new_sampling

a94b840

Refresh example notebooks after cache-cleanup re-execution

990259a

Timestamps and a few result-data fields refreshed from the most recent test session that exercised every notebook against the pruned cache.

rajeee closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to OEDI Format#74

Switch to OEDI Format#74
rajeee wants to merge 113 commits into
mainfrom
new_sampling

rajeee commented Apr 23, 2026 •

edited

Loading

Uh oh!

rajeee commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rajeee commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Schema pivot (the breaking change)

Performance work enabled by the pivot

Test infrastructure rebuild

Packaging + tooling (rides along)

Bug fixes uncovered along the way

Test plan

Migration notes for consumers

Uh oh!

rajeee commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rajeee commented Apr 23, 2026 •

edited

Loading