Skip to content

Switch to OEDI Format#74

Closed
rajeee wants to merge 113 commits into
mainfrom
new_sampling
Closed

Switch to OEDI Format#74
rajeee wants to merge 113 commits into
mainfrom
new_sampling

Conversation

@rajeee
Copy link
Copy Markdown
Collaborator

@rajeee rajeee commented Apr 23, 2026

Summary

Permanently drops support for the legacy 3-table BSB schema (separate baseline / upgrades / timeseries tables) in favor of the OEDI 2-table format (annual_and_metadata + timeseries). This is the dominant theme of the PR and a hard breaking change — there is no fallback path for the old shape. All query-generation, snapshot fixtures, integrity checks, and tests now assume the OEDI shape.

Version bump: 0.3.00.4.0.

Schema pivot (the breaking change)

  • New 2-table model: md_table (annual + metadata, replaces baseline-and-upgrades) and ts_table (timeseries). The bs_table / up_table handles are now thin SA aliases over md_table, kept where two distinguishable handles are needed for self-joins.
  • Renames at the QueryCore level: bs_table/up_tablemd_table, bs_key/up_keymd_key, bs_bldgid_columnmd_bldgid_column, _bs_completed_status_col/_up_completed_status_col_md_completed_status_col, _up_upgrade_col_md_upgrade_col, _restrict_targets_bs_restrict_targets_md, plus matching helpers.
  • Constructor signature changes: the legacy 3-tuple (db, baseline, upgrades) form is gone — table_name is now str | tuple[str, Optional[str]].
  • TS upgrade-pair queries rewritten to a single-scan MAX(CASE WHEN upgrade=…) + GROUP BY shape (replaces the previous baseline/upgrade self-join), with a CTE → subquery fallback for SQLAlchemy 2.0 literal-bind expansion.
  • upgrade=0 filter now threaded through baseline-side joins; baseline rows filtered out of upgrade-side report queries.
  • Multi-column / composite unique-key support for OEDI tables (e.g. ComStock tract-denormalized metadata): unique_keys schema, multi-column restrict keys, and full unique-key tuples returned from building-id helpers.

Performance work enabled by the pivot

  • TS aggregation unified: pre-bucket the inner GROUP BY, defer pure-bs enduses to the outer SELECT (constant per building, no need to materialize per-15-min and re-aggregate).
  • Pre-aggregate bs to building grain before TS join to eliminate tract fan-out on ComStock.
  • Fold join_list joins into the bs_per_bldg subquery; fall back to direct ts ⋈ bs JOIN when join_list is present.
  • Skip inner timestamp materialization for year-collapse TS queries.
  • Push user bs_restrict into the inner ts ⋈ bs JOIN condition so Athena predicate-pushes into the metadata scan (minutes vs. timeouts on ComStock).

Test infrastructure rebuild

  • Replaced legacy test_BuildStockQuery.py / test_BuildStockSavings.py / test_query.py with snapshot-driven test_query_snapshots.py and a 2102-line test_invariants.py that cross-checks independent code paths (annual vs ts_year_collapse vs sum(ts_monthly), savings additivity, applied_in semantics, etc.).
  • Content-addressed disk cache replaces the old pickle cache; SQL is normalized at compile time; cost-history lookup + cost-regression gate guard against runaway Athena spend.
  • Snapshot fixtures regenerated for the OEDI shape across aggregates_and_savings, restrict_avoid, savings, timeseries, utility, and helper-method snapshots.
  • Example notebooks auto-generated from snapshot entries and executed via nbclient with safety gating.

Packaging + tooling (rides along)

  • Migrated from setup.py to PEP 621 pyproject.toml with hatchling as the build backend.
  • Dev deps moved to PEP 735 [dependency-groups] (install via uv sync --group dev); [project.optional-dependencies.full] retained for consumers.
  • Python floor raised to 3.12 (requires-python = ">=3.12").
  • uv.lock committed for reproducible dev/CI environments; CI switched to uv and to a 3.12/3.13/3.14 matrix.
  • Lint: kept flake8 (ruff was tried and reverted for now); --update-snapshot semantics tightened so snapshots can't paper over real numeric divergence.

Bug fixes uncovered along the way

  • aggregate_ts_by_eiaid was binding the eiaid column to md_table instead of bs_table after the collapse.
  • Several comma-join bugs in check_ts_bs_integrity, get_distinct_count, and select-from-md_table call sites (DUPLICATE_COLUMN_NAME on Athena).
  • applied_only filter on TS query paths was dropping rows it shouldn't have (validated by an xfail invariant flipping green).
  • get_upgrades_csv cross-join, get_upgrade_names malformed SQL, get_successful_simulation_count crash, viz_data monthly-results query path, and several composite-key handling regressions.
  • _restrict_targets_md now recognizes bare unprefixed column names.
  • TS pivot crash on calculated columns; get_quartiles correctly disallowed on TS queries.

Test plan

  • uv sync --group dev resolves cleanly on Python 3.12 / 3.13 / 3.14
  • pytest -vv passes (snapshot tests + invariants)
  • flake8 buildstock_query is clean
  • Smoke-test BuildStockQuery against a real OEDI ResStock + ComStock workspace (annual, savings, TS, multi-state, composite-key)
  • Run example notebooks end-to-end against the cached snapshot fixtures

Migration notes for consumers

  • Anyone passing table_name=(db, baseline, upgrades) must switch to a single OEDI annual_and_metadata table name (or (db, md, ts)).
  • Anyone reading bsq.bs_table / bsq.up_table / bsq.bs_key will continue to work via the alias shims, but the canonical attribute is now bsq.md_table / bsq.md_key.
  • The 3-table BSB workspace shape is not supported on 0.4.0 — pin to 0.3.x if you still need it.

@rajeee rajeee changed the title uv-native packaging + applied_in features (v0.4.0) uv-native packaging migration (v0.4.0) Apr 23, 2026
rajeee and others added 8 commits April 22, 2026 21:11
New [unique_keys] TOML block with metadata and timeseries key lists
(timeseries must be a subset of metadata). Three new schemas ship with
it: comstock_oedi_state_and_county, resstock_oedi_new, resstock_oedi_vu.

bs_key, up_key, ts_key tuples (plus *_key_cols properties) expose the
configured unique-key identity of each table. Aggregate sample counts,
load-duration grouping, and timesteps verification now use these tuples
via _count_distinct() and multi-column group-by. Single-key schemas
still emit byte-identical SQL.

Also carries prior session fixes: CACHED sentinel skipped in batch
helpers, cost log precision bumped to 2 decimals, and _get_rows_per_building
groups on the configured timeseries keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
_get_applied_in_subquery now projects all metadata key columns, and
_add_applied_in_restrict emits a tuple-IN filter against the correct
side (ts_key when timeseries-backed, bs_key otherwise). Single-key
schemas still emit the original single-column IN form.

_get_restrict_clauses and _normalize_restrict_subquery extended to
accept a tuple of columns paired with a multi-width subquery, using
sa.tuple_(c1,...).in_(subquery) under the hood.

Also fix an outdented return in _get_buildings_by_change that broke
the module at import time under strict linting.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
get_building_ids and get_buildings_by_locations now return DataFrames
projecting all bs_key columns instead of just bldg_id. _get_upgrade_buildings
and get_buildings_by_change return list[int] for single-key schemas and
list[tuple] for multi-key. _get_full_options_report's array_agg uses
row(...) packing when the upgrade key has more than one column.

Single-bldg_id default paths keep byte-identical SQL and return shapes.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
get_results_csv, get_results_csv_full, get_upgrades_csv, and
get_upgrades_csv_full now return flat DataFrames. Parquet readbacks
that still carry a bldg_id index are reset so the returned shape is
consistent regardless of source.

Breaking change: callers that relied on df.loc[bldg_id] must now do
df[df.bldg_id == bldg_id] or set the index themselves.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Old check compared row counts between timeseries and baseline/upgrade
tables, which produced false positives whenever a bldg_id could
legitimately appear in baseline more than once (e.g. per state).

New flow:
  1. Distinct-bldg parity per upgrade between timeseries and baseline/upgrade
     tables, using bs_key and ts_key for distinct counting. A dedicated
     _get_metadata_distinct_bldg_count replaces the row-counting derivation
     from get_success_report.
  2. Duplicate detection per table: group by bs_key on baseline, up_key +
     upgrade on the upgrade table, and ts_key + timestamp (+ upgrade when
     present) on the timeseries table. First 5 offending rows are printed.
  3. Existing per-building row-count verification retained, now phrased in
     terms of partitions rather than buildings.

_get_ts_report now counts distinct ts_key tuples instead of distinct bldg_id.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Near-duplicate of _get_rows_per_building with no callers in the
package. Drop it rather than maintain two versions through the
multi-key refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- bs_key / up_key / ts_key reflect configured unique_keys and revert
  to (bldg_id,) when cleared.
- applied_in emits tuple-IN for multi-key schemas and scalar-IN for
  single-key.
- aggregate_timeseries with timestamp_grouping_func emits
  count(DISTINCT (c1, c2)) for multi-key and count(distinct(c1)) for
  single-key.
- get_building_ids projects all bs_key columns.

Also switch _count_distinct to sa.distinct(sa.tuple_(*cols)) so the
multi-column form renders as the standard SQL `count(DISTINCT (c1, c2))`
rather than `count(distinct(c1, c2))`, which Trino parses as a
single-arg function call on a row tuple.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
rajeee added 18 commits April 23, 2026 00:18
The branch earlier swapped flake8 for ruff with an aggressive ruleset
(E, W, F, I, UP, B) that flagged ~280 pre-existing UP/B violations
unrelated to unique_keys work. Revert to flake8 with the same config
used on main (.flake8, max-line-length = 120) so this PR stays focused
on multi-key unique-keys changes.

Also wrap two 137-char lines in _get_metadata_distinct_bldg_count so
flake8 passes cleanly.
UNLOAD writes query results as multiple parquet files, and row order
across chunks is not guaranteed when read back. test_savings_shape_with_timestamp_grouping
was intermittently failing because the time column started at different
points depending on which parquet chunk pandas read first.

Sort df1 and df2 by (geometry_building_type_recs, time) before comparing
so the assertion is order-independent.
Set the pandas index back on the four CSV-returning functions to
preserve backward compatibility with callers that rely on
df.loc[bldg_id]. For single-key schemas the index is a single-level
Index on bldg_id (byte-identical to main). For multi-key schemas it
becomes a pandas MultiIndex on the configured bs_key / up_key tuple.

get_results_csv, get_results_csv_full, get_upgrades_csv, and
get_upgrades_csv_full all use `set_index(list(self.bs_key))` (or up_key).
Full variants reset any existing index from the parquet before re-
indexing so the result is consistent regardless of how the parquet
was written.
Fixes ValueError when querying timeseries data against schemas whose
timeseries unique key spans multiple columns (e.g. bldg_id + state).
_add_applied_in_restrict now threads a width-matched subquery to the
correct side, _split_restrict routes tuple column refs to the ts side,
and restrict/avoid accept (col_tuple, subquery | list-of-tuples).
bs_tbl in this branch is a ts-side subquery whose columns match the
timeseries unique key, not the metadata key. Looking up bs_key on it
raises KeyError for schemas whose metadata key has cols absent from ts
(e.g. in.nhgis_tract_gisjoin). Surfaced by a monthly TS query against
the comstock_oedi_state_and_county schema.
Clears the deprecated query surface ahead of the 2.0 release so the new
bsq.query(...) facade is the single public path for annual, timeseries,
and savings aggregations. Internal callers (utility_query, viz_data) are
migrated, savings_query is deleted outright, and the savings_shape-compat
branches in aggregate_query._query collapse to single canonical paths.

Also drops TSQuery.split_enduses (application code should batch) and
TSQuery.collapse_ts (use timestamp_grouping_func="year"); SavingsQuery
schema is removed.
The else-branch in init_monthly_results referenced run_obj._conn, which
does not exist on BuildStockQuery, and used run_obj.params instead of
run_obj.run_params for the S3 unload bucket. Both were latent bugs that
only fired on cache miss; the prior aggregate_timeseries SQL happened to
match cached entries, so the branch never ran in CI.

Replace the hand-rolled unload/cache dance with a direct run_obj.query()
call, which routes through execute() and handles caching, S3 result
lookup, and UNLOAD submission consistently with the rest of the library.

Note: test_Viz fixtures may need their reference pkl cache regenerated
against the new SQL text; that is a separate fixture-maintenance task.
When a timeseries query asks only for upgrade output (no savings, no
baseline) with applied_only=True, the ts_b/ts_u subquery pairing adds
unnecessary cost and, more importantly, returns zero rows for runs whose
timeseries table lacks upgrade=0 rows (upgrade-only runs). The column
expressions for applied_only reference only up_tbl, so we can aggregate
ts_u directly with the same shape the upgrade_id="0" branch already uses.

Also refresh the viz test reference caches with entries keyed off the new
SQL text.
Relocate the pre-snapshot test modules (test_BuildStockQuery.py,
test_BuildStockUtility.py, test_query.py, test_Viz.py, and their helpers)
into tests/legacy/ so pytest stops collecting them. The new snapshot-driven
framework replaces them and the old files are kept for reference.
Introduce a test framework that pins BuildStockQuery.query() generated SQL
and reference result data per curated argument set, plus cross-query
invariant checks that verify internal consistency without hitting Athena.

Framework (tests/test_utility.py, tests/conftest.py,
tests/test_query_snapshots.py):
  - Each snapshot entry in tests/query_snapshots/{schema}/*.json carries
    one or more equivalent arg-variants, a .sql file (exact generated
    SQL), and a .parquet file (reference DataFrame).
  - Normalized-whitespace SQL comparison with sqlglot fallback; variant
    divergence surfaces as per-variant failure blocks.
  - Pytest options --check-data, --update-sql, --update-sql-and-data
    control data comparison and (re)generation of reference files.
  - Session-scoped fixtures for comstock_oedi_state_and_county and
    resstock_oedi_vu schemas against buildstock_sdr on rescore.

Invariants (tests/test_invariants.py):
  - annual == ts-year-collapse == sum(ts-monthly) for enduse totals.
  - sample_count / units_count consistency across query flavors (resstock
    only; comstock counts intentionally skipped pending a library fix).
  - baseline - upgrade ~= savings within a single include_savings query.

Seeded with ~30 entries across annual/timeseries/savings/applied_only/
restrict_avoid/mapped_column for both schemas, CO-restricted and limited
to electricity/natural-gas totals to keep Athena usage small.

See tests/query_snapshots/README.md for the workflow.
_add_restrict already accepts annual_only to route restrict columns
through the baseline vs timeseries table, but the caller in
BuildStockAggregate._query was never passing it, while the adjacent
_add_avoid call was. Bring restrict into line with avoid so annual-only
queries consistently resolve restrict columns against the baseline
table.
On schemas where one TS row maps to multiple metadata rows (e.g.
comstock_oedi_state_and_county, which has one metadata row per
(bldg_id, nhgis_tract_gisjoin, state) but one TS row per (bldg_id,
state)), the TS-grouping branch of BuildStockAggregate._query was
returning sample_count values that undercounted by the per-TS-key
metadata multiplicity — empirically 1.4x-100x across building types.

Count distinct over the full metadata key on the baseline table
(already present in tbljoin) instead of over the TS key on the TS
subquery. For schemas where bs_key == ts_key (resstock_oedi_vu etc.)
the resulting count value is identical; SQL differs because columns
are now referenced via bs_table.

Comstock TS-grouping snapshot data needs regeneration after this
change:

    pytest -s -v tests/test_query_snapshots.py --update-sql-and-data

Re-enable the sample_count/units_count check in the comstock invariant
test once snapshots are refreshed.
The library bug that made these undercount on comstock TS-grouping queries
was fixed in 25fa3fa. Restore the active count-column comparisons to guard
against regression. Now 4/4 invariants pass including the comstock counts.
Only _get_gcol (the group_by path) used to fall back to the prefixed
column name when a bare name didn't resolve. Restrict, avoid, and the
TS side of _split_restrict went straight to _get_column and raised
ValueError on bare names like "state" / "vintage" whenever the
underlying table only exposed the prefixed form ("in.state" etc).

Move the fallback into QueryCore._get_column itself so every caller
that resolves a user-provided string column gets the same behavior.
Simplify _get_gcol to drop its now-redundant except block.

Also trim trailing-newline inconsistency in
comstock_oedi_state_and_county.toml.
Now that _get_column auto-prepends the characteristics prefix, the
snapshot JSONs no longer need per-schema hand-prefixing for restrict
and avoid. Simplify to a uniform convention:

  - group_by: write "in.<col>" explicitly (readable, consistent across
    schemas).
  - restrict / avoid: write bare "state" / "vintage" — the library's
    auto-prefix fallback resolves to "in.state" / "in.vintage" when
    that's what exists on the table.

Comstock group_by entries gain their "in." prefix; resstock
restrict/avoid entries drop theirs. Stored SQL is unchanged because
the resolved SQLAlchemy column objects are the same.
Now that QueryCore._get_column auto-prepends the characteristics prefix
for both group_by and restrict/avoid, snapshot JSON entries can rely on
bare column names everywhere. Stored .sql files are unchanged — the
fallback resolves bare "state" / "geometry_building_type_recs" /
"comstock_building_type" / "vintage" to the same SQLAlchemy column
objects that were previously written out explicitly.
Collapse four near-duplicate test functions into two parametrized ones:

  test_annual_equals_ts_year_equals_ts_monthly_sum[resstock_...]
  test_annual_equals_ts_year_equals_ts_monthly_sum[comstock_...]
  test_savings_decomposition[resstock_...]
  test_savings_decomposition[comstock_...]

Each case provides the fixture name, schema directory, per-leg enduse
lists (comstock baseline uses "..kwh" suffix, TS does not), and the
group-by column. The bsq fixture is resolved at test time via
request.getfixturevalue so the case table stays declarative.

Net: -48 lines, same 4 cases, no behavior change (15/15 pass).
rajeee added 20 commits April 27, 2026 11:07
These queries fire on every BuildStockQuery(skip_reports=False) init
and have already produced two comma-join regressions. Pinning their
SQL shape in a snapshot catches future re-introductions automatically.

Refactor: split SA query construction out of _get_metadata_distinct_bldg_count
and _find_duplicate_rows into _build_*_queries* helpers so the snapshot
harness can compile without executing. Add get_query_only kwarg to
check_ts_bs_integrity returning the full list of compiled SQL strings
in execution order.

Mark the entry nondeterministic so the data check never runs —
_get_rows_per_building scans the entire TS table and would be a multi-TB
landmine if the harness tried to validate data on a hash refresh.
The TS upgrade-pair pivot path looked up `ts.c[e.name]` for each enduse
when building the per-side CASE expressions. That works for bare ts
columns but raises KeyError on Labels produced by `get_calculated_column`,
since `e.name` is the user-provided label (e.g. "elec_minus_gas") rather
than a real ts column.

Wrap `e.element` (the underlying arithmetic expression) in CASE for
Labels and the column directly otherwise. The expression already
references raw ts columns, so the CASE correctly gates the whole
arithmetic per upgrade side and the result keeps its `bs__<name>` /
`up__<name>` label for the outer _SideView lookup.

The original `calculated_column_elec_minus_gas` snapshot only exercised
table='baseline' + annual_only=True + upgrade_id=0 — none of the three
dimensions needed to enter the broken branch. Add two TS pivot variants
(applied_only=False to force upgrade-pair pivot, and applied_only=True +
include_baseline=True to additionally exercise applied_in synthesis) so
this regression class is caught next time. Both restrict to state=CO to
keep the TS scan bounded.

Also: route `placeholder_annual` through the calc-column test's
expression substitution so TS-table calculated columns resolve to the
suffix-less ts column names on comstock.
Old shape (single-scan + SUM-CASE pivot, GROUP BY raw timestamp):
- Inner GROUP BY at (bldg, 15-min, state) granularity → billions of groups
  on national-scope hourly queries → Athena timeouts at 30 min.
- SUM(CASE WHEN ...) puts arithmetic INSIDE the aggregate phase, defeating
  scan-time projection pushdown.

New shape (two-level subquery + FILTER aggregates, GROUP BY bucketed time):
- Innermost ts_flat subquery materializes per-row enduse arithmetic as a
  scalar before any aggregation — Trino can push the projection down to
  the scan layer.
- Outer ts_pivot subquery uses FILTER (WHERE upgrade=N) per side and
  GROUPs BY at the user's time granularity (date_trunc pre-applied):
  4× fewer keys for hourly, 720× fewer for monthly, 35,000× fewer for yearly.
- Per-bldg granularity preserved for correct weight propagation through
  the outer JOIN to bs.
- ComStock composite primary key (bldg_id, state) handled correctly via
  ts_unique_keys flowing through every layer.

Carries forward _inner_rows = count(*) FILTER (WHERE upgrade=0) so the
outer rows_per_sample metric (which measures underlying 15-min cadence)
stays correct after pre-bucketing. units_count is self-correcting.

Also fixes a pre-existing calc-col bug in the annual flow: get_col(up_tbl,
col) for a Label fell through to returning the bs-bound expression.
Annual upgrade!=0 + applied_only=False with a calc col enduse silently
returned BASELINE values. Fix: rebind_to(col, target_tbl) using SA's
ClauseAdapter(target, adapt_on_names=True) to traverse the Label's
expression and rewrite column refs.

Test changes:
- New test_calculated_column_matches_manual_decomposition_per_flow
  invariant (replaces the cross-flow check that had false-positive drift
  amplification on comstock).
- Updated SQL substring assertions in test_schema_unique_keys to match
  the new ts_flat / ts_pivot subquery shape.

Verified: 68/68 invariants pass (including the 6 bare-fuel cross-flow
variants that were failing before).
Six TS-pivot-driven entries auto-updated by --update-snapshot after the
data check confirmed numerical equivalence:

Cost wins (query time, data scanned unchanged):
  savings_ts_monthly_upgrade1_electricity (resstock):  -33%
  savings_ts_monthly_two_fuel              (resstock): -48%
  savings_ts_monthly_upgrade1_electricity (comstock): -50%
  savings_ts_monthly_two_fuel              (comstock): -47%
  calculated_column_ts_pivot_with_baseline (comstock): -55%

  savings_ts_monthly_upgrade1_quartiles    (comstock): -57%
  ↑ NOT yet written (data-mismatch — see below).

Two entries still need user attention:

1. savings_ts_monthly_upgrade1_quartiles (resstock): cost regression +32%
   on query time alone (data scanned unchanged). Auto-blocked per
   CLAUDE.md cost-regression gate.

2. savings_ts_monthly_upgrade1_quartiles (comstock): data-status=mismatch.
   The new pivot pre-buckets time at the inner GROUP BY, so
   approx_percentile(coalesce(bs - up)) is now over hourly savings
   values, not raw 15-min values. This is a real semantic change of the
   savings-quartile distribution (smoother — averaging within hours
   reduces outliers). Both quartile-savings entries need a deliberate
   --overwrite-snapshot from the user.
The TS pivot's inner GROUP BY pre-buckets time at the user's
`timestamp_grouping_func` granularity. Quartile aggregates take the
percentile over the resulting per-(bldg, bucket) summed values — which
is the consistent interpretation for a monthly query: "the 50th
percentile of MONTHLY building totals", not "the 50th percentile of
15-min instantaneous values" (the latter doesn't compose with the rest
of a monthly rollup).

The pre-pivot code emitted the 15-min-level quartile because that's how
the SQL fell out of its row-per-15-min pivot subquery — not by design.
The new behavior is more meaningful and matches user intent.

Snapshots for `*_quartiles` TS entries need `--overwrite-snapshot` to
adopt the new (correct) values.
Quartiles over timeseries are confusing: percentile-of-15-min-values
doesn't compose meaningfully with a monthly rollup, and percentile-of-
per-bucket-sums is non-obvious. Sum/avg/min/max are sufficient
diagnostics on the TS path; full distributional analysis belongs on
annual queries.

Pydantic validator on Query rejects get_quartiles=True with
annual_only=False at construction time. Snapshot entry
savings_ts_monthly_upgrade1_quartiles removed from savings.json,
its cache files deleted, and the QUARTILE_ENTRIES invariant list
updated to drop it. Other quartile coverage (annual_baseline_quartiles
and savings_annual_upgrade1_quartiles) remains.
Three structural improvements to the TS aggregation pipeline, building on
the upgrade-pair pivot work:

1. **Single shape for baseline + upgrade-pair**: ts_flat → ts_aggr → outer
   JOIN. Single-upgrade uses plain SUM(_v__col); upgrade-pair uses FILTER
   per side. Inner pre-bucketing applies uniformly: 4×/720×/35,000× shuffle
   key cardinality reduction for hourly/monthly/yearly. Raw 15-min single-
   upgrade output skips the aggregate layer (one row in, one row out).

2. **Partition-aligned GROUP BY ordering**: inner ts_aggr GROUPs BY
   (state, timestamp, bldg_id) — partition col first. Outer SELECT
   defaults to inserting time AFTER the user's group_by columns, so
   GROUP BY (state, timestamp) at outer too. Matches the parquet's
   `by_state` partitioning instead of fighting it.

3. **Pure-bs enduses skip ts_flat entirely**: characteristic columns
   (e.g. `bs.c["in.sqft..ft2"]`) and pure-bs calc cols are projected at
   the outer SELECT directly via `sum(bs.<col> * bs.weight)`. Eliminates:
   - the duplicate bs JOIN inside ts_flat that Gemini-style reviewers
     flagged
   - the rows-per-bucket multiplier on characteristic enduses (sqft was
     getting `4 × sum(sqft × weight)` for hourly; now correctly returns
     `sum(sqft × weight)` once per bucket)
   Mixed bs+ts calc cols still take the inner-join path (preserves
   existing behavior for that rare case).

Snapshot SQL hashes drift across all TS entries — data unchanged but
hashes need user-driven `--update-snapshot`. All 68 invariants pass;
21 schema-unique-key tests updated to match new shape.
ComStock's md_by_state_and_county_parquet is tract-denormalized: multiple
rows per (bldg_id, state). The previous TS-aggregation pipeline did a
direct ts_aggr ⋈ bs JOIN at full md grain, which fanned out each pivot
row by N_tracts-per-bldg. The post-join COUNT(DISTINCT (bldg, tract,
state)) shuffle ran over the blown-up rowset — Stage 5 of a national
hourly query processed 6.28 B rows / 499 GB and aborted at 17h33m on the
user's failing query.

ResStock's md is one row per building, so this never came up there.

Fix (suggested by Claude-on-web): pre-aggregate bs to per-(bldg, state)
grain in a `bs_per_bldg` subquery before the outer JOIN. All bs-side
aggregations are linear in `weight`, so the math collapses cleanly:
  - bldg_weight = SUM(weight) per bldg
  - tract_count = COUNT(*) per bldg (used for sample_count via SUM at outer)
  - bs-side characteristic columns (sqft, building_type, etc.) — projected
    via arbitrary() since they're per-bldg constants.

Outer SELECT becomes uniform: every aggregate is sum(<value> * bldg_weight)
(for ts-side enduses) or sum(bs_per_bldg.<bldg_constant>) (for sample_count
and bldg_weight). No special pure-bs path; no double-multiplication; no
tract-grain JOIN.

Also folds the single-upgrade branch into the unified shape: previously
it took a direct ts ⋈ bs join (no ts_aggr layer), which couldn't benefit
from bs_per_bldg's tract-fan-out elimination. Now ts_aggr applies
uniformly. The per-bldg shuffle stage (3 B keys for national hourly) is
narrow scalars and finishes quickly; the original Stage 5 blowup was
specifically the post-join fan-out, not ts_aggr.

Test changes:
- 21/21 schema-unique-keys tests updated to match the new SQL substring
  shape (count_distinct → sum(tract_count); bs ⋈ → bs_per_bldg ⋈).
- 66/66 invariants pass (data correctness verified end-to-end against
  Athena: cross-flow consistency, savings decomposition, multi-state
  additivity, quartile monotonicity, sample/units-count matching).
- 9 snapshot SQL-hash mismatches expected from the new shape — user-
  driven --update-snapshot regenerates after.

The user verified the original failing query (national ComStock hourly
baseline with sqft + calc-col enduses, 17h33m timeout before) now
finishes successfully.
When a user-supplied restrict references a bs-side column by its bare
name (e.g. `"geometry_building_type_recs"` instead of the full
`"in.geometry_building_type_recs"`), `_restrict_targets_md` previously
returned False because it only checked the bare form against
`bs_table.columns`. This caused `_split_restrict` to drop the clause into
`extra_restrict` (clauses that target neither md nor ts), which was then
applied at the outer WHERE.

After the bs_per_bldg refactor, the outer FROM no longer references the
canonical bs alias directly. An outer-WHERE clause like `bs.<col> = 'x'`
created an implicit comma-join, fanning out the bldg×tract product again.
On a national hourly TS query this produced a query that ran past 12+
minutes before being cancelled.

Fix: in `_restrict_targets_md`, also check the column-name with each
configured prefix (`_char_prefix`, `_out_prefix`) appended. Now
`"geometry_building_type_recs"` correctly classifies as md-side and gets
routed through `bs_per_bldg`'s inner WHERE.

This was a latent issue that became visible (and harmful) only after the
bs_per_bldg refactor stopped exposing the bs alias at the outer level.
The bs_per_bldg refactor replaced the canonical bs alias in the outer
FROM. join_list joins reference bs columns by the canonical alias, so
they were unbound and SA generated implicit comma-joins.

Fix routes _add_join's bs-side reference through the bs_alias param
(defaults to bs_table; TS queries pass bs_per_bldg). Also propagates
join_list's bs-side baseline columns into bs_per_bldg via arbitrary()
so they're available for the outer JOIN ON clause.

baseline_column_name in join_list can be either a string or an SA
Column (utility code passes Column objects directly). Both paths
handled.
Utility queries (aggregate_ts_by_eiaid, etc.) pass a join_list referencing
non-bs tables (e.g. eiaid_weights). bs_per_bldg can't cleanly absorb
weights/group-bys bound to those tables without comma-joining them in.

Quick fix: when params.join_list is non-empty, skip the bs_per_bldg
pre-aggregation and use the pre-refactor direct ts ⋈ bs JOIN shape.
Utility queries are typically eiaid-restricted, so tract fan-out is
small enough not to blow up.

A follow-up will fold the join_list table into bs_per_bldg (per the
user's suggestion: "any extra join we do - such as for eiaid_weight is
only for the metadata column and we should be able to fold it into
metadata per state/bldg subquery"). That keeps the tract-fan-out
elimination win for utility queries too. For now this fallback at
least restores correctness.

Outer SELECT's grouping_metrics / total_weight / agg_weight all gate
on `"bldg_weight" in md_alias.c` to detect which shape ran.
Per user request: utility queries' join_list joins (e.g. eiaid_weights
mapping bldg→eiaid) are metadata-side extensions of bs. Folding them
INTO bs_per_bldg's FROM keeps the outer query a clean
ts_aggr ⋈ bs_per_bldg shape with no extra outer JOINs and no comma-
join risks.

bs_per_bldg's GROUP BY stays at (bldg, state) regardless of whether the
user grouped by a join_list-table column. join_list group-by columns
(e.g. eiaid) get projected via arbitrary() — picks one value per bldg.
That's only correct when the column is per-bldg constant after the
restrict (typical utility flow batches by single eiaid). User signed
off on this semantic: "the building doesn't need to appear twice. Only
when user explicitly passed eiaid in the group by do we need to
preserve the eiaid group" — preserved at outer GROUP BY level.

Restrict clauses targeting join_list tables (e.g.
`eiaid_weights.eiaid IN ['4110']`) are routed from extra_restrict to
the bs_per_bldg WHERE so the join filter pushes through.

The outer _add_join is a no-op for these queries (the join already
happened inside bs_per_bldg). For annual or non-bs_per_bldg paths,
_add_join still adds the join at outer.
The cost gate fired on any percentage delta over ±20%, even when the
absolute change was sub-MB or sub-second. After the bs_per_bldg refactor
this caused 19 entries to be blocked from update-snapshot writes despite
having matching data — most were noise-floor regressions like
4241→4242 MB or 5→6 s.

Add per-metric absolute floors (10 MB, 10 s) that combine with the
existing ±20% percentage check: a metric trips the gate only if BOTH
thresholds are exceeded. The two existing all-noise floors
(_COST_FLOOR_MB / _COST_FLOOR_MS) stay in place as the early-return
gate.

The refresh writes through 13 entries that were previously blocked.
The 6 remaining blocks are real shape regressions (5 ts_year/inv_year
entries with +3 GB scan from the new ts_aggr layer forcing full TS
partition scans on year-collapse; 1 calculated_column entry with +14 s
wall-clock) for separate investigation.
When timestamp_grouping_func='year' the outer query collapses time via
SUM regardless — carrying timestamp through the inner ts_flat / ts_aggr
forced Athena to materialize the truncated value before the inner GROUP
BY, blocking partial-aggregation pushdown into the scan.

For year-collapse, drop timestamp from ts_key_names entirely. The inner
SELECT no longer projects it, the inner GROUP BY no longer keys on it,
and Athena fuses the per-bldg sum directly with the scan.

Recovers pre-pivot performance for the year-collapse shape:
- inv_baseline_ts_year_by_building_type: 4242 → 1128 MB (-73%),
  15694 → 5376 ms (-66%, 10s faster).
- 5 ts_year / inv_year cost regressions clear, plus 5 dependent
  snapshot entries write through cleanly.

The bs_per_bldg layer stays in place to handle ComStock's tract
fan-out — the regression was specific to the inner timestamp
materialization, not the per-bldg pre-aggregation.
Investigation: ran calculated_column_ts_pivot_with_baseline against
Athena 3 times each on the OLD and NEW SQL with cache-busting nonces.

  OLD (3 fresh runs): 9881 / 6310 / 14979 ms
  NEW (3 fresh runs): 10408 / 7422 / 19444 ms

~10s mean per shape, with a natural ~6-19s spread per single execution.
The +199% / +14.3s "regression" was an unlucky pairing of two outliers
(stored 7187 ms baseline vs fresh 21454 ms) — the SQL shapes are
statistically equivalent. Bytes-scanned identical (~8.2 GB both shapes,
deterministic).

No code change; just refresh the cost sidecar via --overwrite-snapshot
so the gate stops reporting noise as a regression.

The 10s absolute floor on the timing gate is smaller than the natural
single-execution spread for queries in this size class. Future work
could either drop the timing check entirely (rely on bytes scanned,
which is what Athena bills) or scale the floor with query size; for
now, accept that single-run timing is informational only.
Each (flavor.json, schema) pair becomes
tests/example_notebooks/<flavor>_<schema>.ipynb containing:
  1. Imports
  2. BuildStockQuery(...) construction (mirrors conftest fixture)
  3. One markdown+code pair per entry that calls the entry's method with
     the entry's args and embeds a head() preview of the result

Notebooks regenerate inside `evaluate_entries` when:
  - the file is missing (bootstrap), OR
  - any entry just had a writethrough, OR
  - --update-snapshot / --overwrite-snapshot was passed (data fresh)

For deterministic git diffs:
  - cell IDs derived from md5(source) — same content → same ID
  - execution_count = None (no run-counter churn)
  - result preview is head().to_string() — bounded, plaintext
  - JSON serialized with stable indent + key order

Result: a runnable companion notebook per flavor whose diff flips
only when SQL or data actually changes.
The generated BuildStockQuery in each notebook now passes
`cache_folder=tests/query_snapshots/<schema>_cache`, so query calls in
the notebook reuse the parquets the test suite already downloaded.
First-run cost drops to ~$0 — Athena returns cached results without
scanning data.

Resolves the cache path with a Path.cwd()-based fallback chain so the
notebook works whether Jupyter was launched from the repo root, the
tests/ directory, or tests/example_notebooks/ itself.
Previously the notebook used Path.cwd() with a 3-way fallback chain to
find the snapshot test cache directory. That worked from the repo root,
tests/, or tests/example_notebooks/, but could fail if Jupyter was
launched from a deeper or unrelated directory.

Use IPython's `_dh[0]` (the notebook's own directory at kernel startup)
to resolve the cache as a relative sibling: `../query_snapshots/<schema>_cache`.
Falls back to "." if `_dh` is missing (e.g. when the cell is run outside
Jupyter), preserving backward compatibility with non-Jupyter execution.

Verified end-to-end via `jupyter nbconvert --execute` on
annual_resstock_oedi.ipynb — 13 cells executed, all queries returned
cached results with $0.00 cost.
Generated notebooks now run through nbclient after writing so cells
contain real DataFrame outputs (rich text/html tables + text/plain)
instead of a pre-embedded preview. The notebook IS the runnable artifact
AND the source of truth for current behavior.

Two safety mechanisms keep regen cheap and bounded:

  1. `_NO_EXECUTE_METHODS` allowlist — `report.get_success_report`,
     `report.check_ts_bs_integrity`, and `agg.get_building_average_kws_at`
     are commented out in source. Each fans out internally to many
     Athena queries the snapshot cache doesn't cover (count-distinct
     per upgrade, per-pair change tracking, the 3.2 TB landmine
     documented in CLAUDE.md). Determined by reading method source —
     only true singleton-execute methods run live.

  2. `_has_unroundtrippable_placeholder` — auto-comments any cell whose
     args contain SA object reprs (`<sqlalchemy....Label at 0x...>`,
     `MappedColumn(bsq=<...object at 0x...>)`, or our own `<Class ...>`
     fallback). Those can't run as written; the user must construct
     the SA object inline.

`allow_errors=True` on the nbclient executor: a single bad cell embeds
its error in the notebook instead of aborting execution. This catches
pre-existing interactive-mode bugs in `bsq.get_results_csv` etc. that
the snapshot tests miss (they run with get_query_only=True, never
exercising the post-query DataFrame access).

End state: 121 cells executed cleanly, 8 commented (unsafe), 10
commented (placeholder), 6 cells with embedded errors documenting
real bugs to fix separately.
Removes unused imports (`os` in main.py and helpers.py, `sqltypes` in
main.py) and a stale `ucol` local in aggregate_query.py, plus a missing
blank line before a nested def and a 124-char comment line. CI Lint
step was failing on these six pyflakes/pycodestyle violations.
@rajeee rajeee changed the title uv-native packaging migration (v0.4.0) Switch to OEDI Format Apr 28, 2026
rajeee added 7 commits April 28, 2026 12:42
The bs_per_bldg subquery (introduced to collapse ComStock's tract fan-out
before joining ts_aggr) was wrapping bs-side group_by columns in
arbitrary() and grouping only on (bldg_id, state). For ComStock — whose
metadata is partitioned at (bldg_id, tract, state) granularity with
weight divided across tract rows — a building straddling counties had
its full weight attributed to whichever county arbitrary() picked,
silently dropping the tract-fractional disaggregation.

Same issue silently affected ResStock's utility_ts_by_eiaid: eiaid_weights
fans out per-county into multiple utilities, so arbitrary(eiaid) collapsed
the weighted apportionment.

Fix: project bs-side group_by columns as bare references and add their
underlying expressions to bs_per_bldg's GROUP BY. Pure-bs enduses
(sqft, vintage) and extra_bs_cols stay under arbitrary() — they're
true per-bldg constants.

Adds three ComStock-only timeseries snapshots (year-collapse: by_county,
by_tract, overall_co) plus an invariant cross-checking sum(by-county)
== sum(by-tract) == overall. Replaces utility_ts_by_eiaid's eiaid=4110
(no CO data → empty) with eiaid=15466 (Xcel CO, ~9.3k samples) so the
snapshot exercises the disaggregation path with real data.
SqlCache now appends each get/put hash to .cache_usage_log inside the
cache folder so we can tell which entries a test session actually
exercised. tests/cleanup_stale_caches.py removes cache files whose
hashes are referenced in neither the snapshot JSONs nor the usage log;
tests/refresh_caches.py wraps the clear → run-tests → cleanup flow.

Also includes the stale entries that the first cleanup run identified
and dropped, freeing the resstock_oedi_cache + comstock_oedi_cache
directories of orphans left over from previous SQL refactors.
Timestamps and a few result-data fields refreshed from the most recent
test session that exercised every notebook against the pruned cache.
Extends the test suite to cover the county-aggregated ComStock metadata
table (`_md_agg_by_state_and_county_parquet`) where buildings within a
county are pre-collapsed into one weight-correct row. The agg schema is
wired into every schema-aware list/dispatcher (fixtures, snapshot and
invariant parametrize lists, placeholder resolver, cleanup script,
notebook builder, backfill costs) and the snapshot harness's
`_patch_hash` now inserts a missing schema key in addition to overwriting
an existing one. `--update-snapshot` populated the new
`comstock_oedi_agg_cache/` with 76 sidecar triples covering 66 JSON
entries.
Walks every snapshot JSON entry across six weight-aggregating flavors
(annual, timeseries, savings, applied_only, restrict_avoid,
invariants_three_way), runs the query on both ComStock schemas, and
asserts equal DataFrames after dropping `sample_count` (the only column
that differs because the agg table has fewer rows). Filters out
nondeterministic entries and aggregations that don't survive county
pre-collapse (mean/max/min/quartiles/arbitrary).

A separate test handles `get_building_ids`: the two schemas return
different partition columns (tract vs county), so the comparison
projects to (bldg_id, state) and dedupes — the resulting building set
is identical across schemas. Both tests run from the snapshot caches in
under 6 seconds with no Athena spend.
Adds the three-way invariant on the applied_in=[1,2] code path that
combines flows the existing tests covered separately but never together:
annual ≡ ts-year-collapse ≡ sum(ts-monthly), now exercised on both the
categorical comstock_building_type axis and the county partition-key
axis whose arbitrary()-collapse silently broke before commit 182ff21.

Three new snapshot entries (annual / ts_year / ts_monthly grouped by
county, applied_only=true, applied_in=[1,2], CO) populate the cache for
both ComStock schemas, gated to comstock_oedi/comstock_oedi_agg via the
JSON `schemas` allowlist since resstock has no county column.
@rajeee
Copy link
Copy Markdown
Collaborator Author

rajeee commented May 4, 2026

Superseded by: #75

@rajeee rajeee closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants