Skip to content

Release v0.9.0#47

Merged
brianlball merged 53 commits into
mainfrom
develop
Apr 10, 2026
Merged

Release v0.9.0#47
brianlball merged 53 commits into
mainfrom
develop

Conversation

@brianlball
Copy link
Copy Markdown
Collaborator

Release v0.9.0

195 commits since last release to main. Major additions:

Highlights

  • 142 MCP tools across 22 skills (up from 126 in v0.3.0)
  • Geometry tools: create buildings from DOE prototypes, import FloorSpaceJS
  • Measure authoring: create, edit, test OpenStudio measures via MCP
  • Tool routing: SDK search, wiring pattern recipes, smart tool recommendations
  • LLM test suite: 170+ tests with progressive difficulty, cross-model benchmarks
  • Stdout suppression fix (issue MCP error -32001: Request timed out on all tools — stdout suppression race condition #42): permanent fd redirect eliminates concurrent tool timeouts and C-level stdout pollution

New skills & tools

  • api_reference: search_api, search_wiring_patterns
  • tool_router: recommend_tools
  • measure_authoring: create_measure, edit_measure, test_measure
  • geometry: create_bar_building, create_new_building, import_floorspacejs
  • object_management: get_object_fields, set_object_property, dynamic list_model_objects
  • HVAC: FourPipeBeam, CooledBeam terminals, set_zone_equipment_priority

Bug fixes

Testing

  • 450+ integration tests across 5 CI shards
  • 170+ LLM agent tests (local-only, not CI)
  • Concurrent tool regression tests
  • Stdout purity tests

See CHANGELOG.md for full details.

Test plan

  • CI shards 1-5 pass
  • Version reads 0.9.0

🤖 Generated with Claude Code

brianlball and others added 30 commits March 15, 2026 08:47
list_files: restrict to /inputs and /runs (reject /opt/*, /repo, etc),
default max_depth=2, files only (no dirs), early-exit on max_results.
list_weather_files: discover EPWs from openstudio-standards gem +
ChangeBuildingLocation/tests + /inputs, report .ddy/.stat companions.
Update change_building_location docstring + server instructions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shard 2 was bottleneck at 9:10, shard 5 idle at 1:04.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 new quality tests (plugloads + boiler × Ruby + Python) verify LLM
creates reusable measures with typed args, defaults, Choice values.
6 new full-chain workflow cases (baseline→measure→sim→compare).
Add Agent to BUILTIN_TOOLS filter so subagent calls don't leak into
tool_names assertions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
create_measure: escape quotes in description/modeler_description,
return ok:false on syntax errors, add Intended Software Tool XML attrs.
edit_measure: robust description regex handles existing quotes.

server instructions: explicit NEVER/ALWAYS for measures, results,
visualization, models, weather, HVAC — prevents LLM from writing
scripts when MCP tools exist.

README: document /inputs mount as preferred file location over uploads
(Analysis mode sandbox bypasses MCP tools entirely).

LLM tests: 4 regression tests reproducing the original debug chat
scenario (quoted descriptions, edit with quotes, XML attrs, syntax
error reporting).

Plans: agent-guardrails.md (completed + remaining), tool-routing.md
(industry research, FastMCP annotations, RAG-MCP, implementation
options for tool grouping).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ard, docstring hints

- compare_runs: per-fuel deltas instead of collapsed totals, Water separated from energy
- create_new_building: clear error when no climate_zone/weather_file instead of silent fail
- create_measure/edit_measure: add get_skill('measure-authoring') tip
- read_file: hint to prefer structured tools over raw IDF reads
- 6 unit tests for compare_runs output shape + Water exclusion
- 1 integration test for create_new_building climate_zone guard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… docstring hardening

Phase 1-3 of tool routing plan:
- search_api skill: introspect openstudio.model classes/methods, catches hallucinated methods
- recommend_tools skill: keyword-based routing to 9 tool groups (core, geometry, hvac, simulation, results, measures, loads, envelope, meta)
- tags on all 141 tools across 22 tools.py files
- docstring hardening: read_file, list_files, view_model, view_simulation_data, generate_results_report, create_measure
- fix Windows ? in ndjson log filenames (conftest)

Tests: 35 unit + 12 Docker integration + 9 LLM A/B
Full LLM regression: 166/172 (96.5%) — no regressions vs Run 7 (97.5%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sources

Curated Ruby snippets showing how to wire coils→loops, terminals→air loops,
zone equipment→zones. Covers: four-pipe beam, cooled beam, DOAS, VRF, PTAC,
PTHP, fan coils, baseboards, WSHP, plant loop HPs, unitary systems,
absorption chillers, setpoint managers, plant/air loop construction.

No Docker build change — recipes are Python dicts shipped with code.
Completes all 6 items in plan-debug-session-fixes.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- server.py instructions: mention both tools for custom HVAC measures
- measure-authoring SKILL.md: "Before Writing HVAC Measures" section
- create_measure docstring: TIP to call search_api + search_wiring_patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…checks

26 tests: required fields, no geometry in snippets, search accuracy for
all 17 HVAC patterns, Ruby snippet validation (addBranchForZone for
terminals, SetpointManager for plant loops, addToThermalZone for zone HVAC).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 new tests: HVAC measure authoring checks create_measure + reference tools,
search_api method verification, search_wiring_patterns for wiring recipes.
Accept alternative tools (get_object_fields, get_skill) as valid — new tools
not yet in LLM training data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…erns checks

Failures document the tool overload problem (FM1). Tests should fail until
the discovery issue is actually solved, not mask it with alternatives.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ToolSearch (ENABLE_TOOL_SEARCH=true) cannot find search_api or
search_wiring_patterns with any query. Other MCP tools (create_measure,
get_object_fields) are discoverable. Need to optimize tool names/descriptions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ools findable

Root cause: ToolSearch indexes at Docker build time, not runtime.
Volume-mounted new tools were invisible. After rebuild:
- search_api: found 1st for "search_api", "SDK methods"
- search_wiring_patterns: found 1st for "wiring patterns", "four pipe beam"
- recommend_tools: found 1st for "recommend tools"

Enriched tool descriptions with use cases, examples, keywords.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…09 pass

7 failures all known flaky. replace_windows_L1 new flaky — agent called
search_api (discovered the new tool!) but didn't call replace_window_constructions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 options analyzed: consolidate to ~80, split into 4 servers, FastMCP
mount, hybrid, or enrich descriptions only. Recommends phased approach:
enrich descriptions first, consolidate typed tools second, split only
if needed for Cursor/client compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

Profile flag on single entry point/image registers subset of skills (~35 each).
Shared model state via /runs volume + auto-save. Includes client limits research,
Docker considerations, testing strategy, and all citations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ools

History shows we compressed descriptions 30% (a78d308) to reduce context,
then ToolSearch made that counterproductive. Enriched descriptions proven
to work (search_api went from invisible to 1st result). Plan: restore
keyword-rich descriptions for 85 tools without restoring bloat.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7 lessons learned from MCP tool discovery at scale (62→142 tools):
description compression was counterproductive (ToolSearch existed but we
didn't know), tags are inert, typed tools beat generic for discovery,
server instructions are the biggest lever (44%→83%), progressive tests
reveal structural limits. Full timeline, metrics, PR history, citations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
22 tools.py files to enrich, README client compatibility section, tool
count updates, new min description length test. Recover keywords from
pre-compression commit (a78d308). No tool removal or architecture changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
22 tools.py files enriched with domain keywords, use cases, field lists.
Reverses description compression (a78d308) that hurt ToolSearch matching.
All tools now have ≥40 char first-line descriptions.

README: tool count 134→142, add client compatibility table (Cursor not
compatible at 40-tool cap). CLAUDE.md + server.py: count 138→142.
New test: test_min_description_length enforces ≥40 chars.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 142 tool descriptions enriched, min 40-char first line enforced,
README/CLAUDE.md/server.py updated, client compatibility table added.
11/12 LLM tests pass. search_api + search_wiring_patterns discoverable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same 7 known flaky failures. No regression from enriching all 142 tool
descriptions. Confirms description changes are safe.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
brianlball and others added 23 commits March 20, 2026 14:46
…ferences

tool-workflows: add HVAC measure verification step
add-hvac: add custom HVAC wiring section
troubleshoot: add SDK method verification section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tool sections now sum to 142. Added: validate_model, extract_simulation_errors,
list_output_variables, compare_runs, list_weather_files, search_api,
search_wiring_patterns, recommend_tools. Added /troubleshoot to skills table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…emphasis

Audit: 82% of tools have no when-to-use guidance, 93% no negative scope,
only 3 use emphasis keywords. Plan covers 142 tools across 4 tiers with
specific confusion pairs, L1 failure analysis, and emphasis targets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-ToolSearch guidance (2024) said "extremely detailed." Post-ToolSearch
guidance (2025) says "semantic keywords." These conflict — verbose
descriptions may hurt ToolSearch discovery. Target only confusion pairs
(16), L1 failures (7), bypass-prone (8), shortest (12).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ovement

Before/after benchmark: 11/15 (73.3%) both runs. 8/8 confusion pairs pass.
4 L1 failures unchanged — structural prompt ambiguity, not description quality.
Added when-to-use, negative scope, emphasis to ~35 tools. Agent's alternative
tool choices are reasonable for vague prompts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tool routing, API reference, wiring recipes, description enrichment
- run_qaqc_L1: accept validate_model (correct for pre-sim check)
- list_dynamic_type_L1: accept get_sizing_*_properties (more specific)
- replace_windows_L1: accept list_common_measures, materials exploration
- check_loads_L1: accept get_space_details (contains loads)

Remove 4 from FLAKY_TESTS — expanded expected sets make them stable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
openstudio-patterns: replace create_example/baseline_osm with create_new_building
  as primary, add validate_model + search_api/search_wiring_patterns sections
new-building: manual workflow uses create_bar_building not create_example_osm
qaqc: add validate_model as first step
energy-report: add generate_results_report, compare_runs, extract_simulation_errors
retrofit: add compare_runs tool for step 5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ured

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… passes

Codex-reviewed all 67 test files against testing rules. Fixed:
- 8 critical: unfalsifiable tests (isinstance(ok,bool), conditional silent pass)
- 17 high: tautological/existence-only assertions, missing value checks
- 17 medium: weak error paths, missing payload validation
- 4 additional findings from plan review (test_add_ev_load, test_add_zone_ventilation,
  test_list_files_items, test_get_air_loop_details)

Key changes:
- test_hvac.py: use baseline+System7 model so HVAC is guaranteed, remove if-guards
- test_common_measures.py: replace isinstance(ok,bool) with ok=True+validate or skip/fail
- test_path_safety.py: monkeypatch Popen for deterministic staging, unconditional asserts
- test_component_controls.py: fix SPM lookup (was searching wrong dict level + wrong type)
- test_building.py: NaN/Inf guard now uses math.isfinite not isinstance
- test_skill_retrofit.py: actually compares baseline vs retrofit energy metrics
- All error-path tests: assert error message content, not just key existence

Also: test_replace_window_constructions now filters for window constructions,
test_set_setpoint_min_max_temp adapts properties to actual SPM type found.

2 thermostat tests still skip — genuine tool bug (Choice-type args passed as String
in OSW), tracked in #40.

270 passed, 3 skipped, 0 failed in Docker.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: wrappers passed schedule names to measures without checking
they match the measure's Choice list filter (e.g. Temperature unitType).
OSW runner error "type String while Choice was expected" actually means
value not in Choice list — misleading.

- _resolve_handle → _resolve_choice_name (returns nameString)
- add _validate_schedule: checks exists, is Schedule, has type limits,
  optionally validates unitType
- thermostat wrappers: reject non-Temperature schedules with clear error
- add_zone_ventilation: require schedule_name (measure vent_sch mandatory)
- tests: use correct Temperature-type schedules, remove lenient skips
- 2 new tests: bad schedule type + missing schedule validation

25/25 test_common_measures pass (was 22 pass + 3 skip)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Audit found tests passing despite tool issues — same pattern as #40
where wrong schedule type was masked by lenient skips.

- test_common_measures: tautological >= → >, ok-only → runner_messages
  check, if ok: pass → assert result=="Success", remove unfalsifiable
  LifeCycleCost readbacks (unsupported type)
- test_response_sizes: fixture if-guards → asserts (except air_loops
  which needs HVAC), skip-on-low-count → assert, tautological <= → value
- test_hvac: >= 0 tautological → >= 1, remove redundant isinstance
- test_hvac_validation: System 8 HW loop exists (not absent), PFP
  terminals == len(zones) not > 0, unit heaters >= len(zones)
- test_component_controls: if prop in changes → assert prop in changes

149 pass, 2 skip (add_pv_to_shading: no shading surfaces in baseline,
filter_zones_by_air_loop: fixture has no HVAC)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Post #40 fix + test audit. 7 previously-flaky L1s now passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove 5 conditional silent-pass patterns in test_path_safety
- Add pytest.approx() to 8 bare float comparisons across 6 files
- Replace 20 `is not None` checks with non-empty string assertions
- Add independent readback assertions for 5 tautological echo tests
- Add @pytest.mark.integration to test_validate_model
- Move SDK-dependent test from unit file to test_object_management

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
  Title: Harden test suite + fix Choice-type measure arg validation
…tries default 0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OSMCP_CODE_MODE env var gates fastmcp CodeMode transform; off by
default. Runner/conftest detect code_mode_active, track tools called
inside execute blocks, and report CodeMode ON/OFF in benchmark md.
Bump fastmcp>=3.1.0 for experimental transform. Includes A/B sweep
data (off=95.3%, on=24.0%) and root-cause writeup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move llm-test-benchmark, testing, frameworks-summary, benchmark-
description-guidance under docs/testing/. Add README.md as technical
report with 7 embedded plots (run history, tier pass rates,
progressive L1/L2/L3, token profile, failure modes, cross-model
sweep, codemode A/B) + paragraph explanations and legends. Include
generate_plots.py + march/april sweep data it sources.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New research notes under docs/knowledge/: architecture/testing
patterns, MCP best-practices gap analysis, reddit discovery thread,
APS agent paper, tool-discovery/LLM-testing writeup. Move geometry
research into knowledge/. Drop stale development-process-findings
and tool-discovery-research.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Redirect C-level stdout (fd 1) to stderr once at startup, give
Python sys.stdout a private fd to the real MCP client pipe.
Catches ALL C-level pollution (SWIG GC, Polyhedron geometry,
future unknowns) with zero races and no per-callsite wrappers.

Fixes concurrent tool timeout (issue #42) and Polyhedron stdout
leak on complex models (test_complex_model_stdout_purity).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@brianlball brianlball merged commit 7ec508c into main Apr 10, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant