Release v0.9.0 by brianlball · Pull Request #47 · NatLabRockies/openstudio-mcp

brianlball · 2026-04-10T15:12:09Z

Release v0.9.0

195 commits since last release to main. Major additions:

Highlights

142 MCP tools across 22 skills (up from 126 in v0.3.0)
Geometry tools: create buildings from DOE prototypes, import FloorSpaceJS
Measure authoring: create, edit, test OpenStudio measures via MCP
Tool routing: SDK search, wiring pattern recipes, smart tool recommendations
LLM test suite: 170+ tests with progressive difficulty, cross-model benchmarks
Stdout suppression fix (issue MCP error -32001: Request timed out on all tools — stdout suppression race condition #42): permanent fd redirect eliminates concurrent tool timeouts and C-level stdout pollution

New skills & tools

api_reference: search_api, search_wiring_patterns
tool_router: recommend_tools
measure_authoring: create_measure, edit_measure, test_measure
geometry: create_bar_building, create_new_building, import_floorspacejs
object_management: get_object_fields, set_object_property, dynamic list_model_objects
HVAC: FourPipeBeam, CooledBeam terminals, set_zone_equipment_priority

Bug fixes

Concurrent tool response loss (issue MCP error -32001: Request timed out on all tools — stdout suppression race condition #42)
Polyhedron stdout leak on complex models
Measure XML checksum staleness
JSON-string list param handling
EUI unit reporting

Testing

450+ integration tests across 5 CI shards
170+ LLM agent tests (local-only, not CI)
Concurrent tool regression tests
Stdout purity tests

See CHANGELOG.md for full details.

Test plan

CI shards 1-5 pass
Version reads 0.9.0

🤖 Generated with Claude Code

list_files: restrict to /inputs and /runs (reject /opt/*, /repo, etc), default max_depth=2, files only (no dirs), early-exit on max_results. list_weather_files: discover EPWs from openstudio-standards gem + ChangeBuildingLocation/tests + /inputs, report .ddy/.stat companions. Update change_building_location docstring + server instructions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Shard 2 was bottleneck at 9:10, shard 5 idle at 1:04. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

4 new quality tests (plugloads + boiler × Ruby + Python) verify LLM creates reusable measures with typed args, defaults, Choice values. 6 new full-chain workflow cases (baseline→measure→sim→compare). Add Agent to BUILTIN_TOOLS filter so subagent calls don't leak into tool_names assertions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

create_measure: escape quotes in description/modeler_description, return ok:false on syntax errors, add Intended Software Tool XML attrs. edit_measure: robust description regex handles existing quotes. server instructions: explicit NEVER/ALWAYS for measures, results, visualization, models, weather, HVAC — prevents LLM from writing scripts when MCP tools exist. README: document /inputs mount as preferred file location over uploads (Analysis mode sandbox bypasses MCP tools entirely). LLM tests: 4 regression tests reproducing the original debug chat scenario (quoted descriptions, edit with quotes, XML attrs, syntax error reporting). Plans: agent-guardrails.md (completed + remaining), tool-routing.md (industry research, FastMCP annotations, RAG-MCP, implementation options for tool grouping). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ard, docstring hints - compare_runs: per-fuel deltas instead of collapsed totals, Water separated from energy - create_new_building: clear error when no climate_zone/weather_file instead of silent fail - create_measure/edit_measure: add get_skill('measure-authoring') tip - read_file: hint to prefer structured tools over raw IDF reads - 6 unit tests for compare_runs output shape + Water exclusion - 1 integration test for create_new_building climate_zone guard Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… docstring hardening Phase 1-3 of tool routing plan: - search_api skill: introspect openstudio.model classes/methods, catches hallucinated methods - recommend_tools skill: keyword-based routing to 9 tool groups (core, geometry, hvac, simulation, results, measures, loads, envelope, meta) - tags on all 141 tools across 22 tools.py files - docstring hardening: read_file, list_files, view_model, view_simulation_data, generate_results_report, create_measure - fix Windows ? in ndjson log filenames (conftest) Tests: 35 unit + 12 Docker integration + 9 LLM A/B Full LLM regression: 166/172 (96.5%) — no regressions vs Run 7 (97.5%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…sources Curated Ruby snippets showing how to wire coils→loops, terminals→air loops, zone equipment→zones. Covers: four-pipe beam, cooled beam, DOAS, VRF, PTAC, PTHP, fan coils, baseboards, WSHP, plant loop HPs, unitary systems, absorption chillers, setpoint managers, plant/air loop construction. No Docker build change — recipes are Python dicts shipped with code. Completes all 6 items in plan-debug-session-fixes.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- server.py instructions: mention both tools for custom HVAC measures - measure-authoring SKILL.md: "Before Writing HVAC Measures" section - create_measure docstring: TIP to call search_api + search_wiring_patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…checks 26 tests: required fields, no geometry in snippets, search accuracy for all 17 HVAC patterns, Ruby snippet validation (addBranchForZone for terminals, SetpointManager for plant loops, addToThermalZone for zone HVAC). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

3 new tests: HVAC measure authoring checks create_measure + reference tools, search_api method verification, search_wiring_patterns for wiring recipes. Accept alternative tools (get_object_fields, get_skill) as valid — new tools not yet in LLM training data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…erns checks Failures document the tool overload problem (FM1). Tests should fail until the discovery issue is actually solved, not mask it with alternatives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ToolSearch (ENABLE_TOOL_SEARCH=true) cannot find search_api or search_wiring_patterns with any query. Other MCP tools (create_measure, get_object_fields) are discoverable. Need to optimize tool names/descriptions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ools findable Root cause: ToolSearch indexes at Docker build time, not runtime. Volume-mounted new tools were invisible. After rebuild: - search_api: found 1st for "search_api", "SDK methods" - search_wiring_patterns: found 1st for "wiring patterns", "four pipe beam" - recommend_tools: found 1st for "recommend tools" Enriched tool descriptions with use cases, examples, keywords. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…09 pass 7 failures all known flaky. replace_windows_L1 new flaky — agent called search_api (discovered the new tool!) but didn't call replace_window_constructions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

5 options analyzed: consolidate to ~80, split into 4 servers, FastMCP mount, hybrid, or enrich descriptions only. Recommends phased approach: enrich descriptions first, consolidate typed tools second, split only if needed for Cursor/client compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion Profile flag on single entry point/image registers subset of skills (~35 each). Shared model state via /runs volume + auto-save. Includes client limits research, Docker considerations, testing strategy, and all citations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ools History shows we compressed descriptions 30% (a78d308) to reduce context, then ToolSearch made that counterproductive. Enriched descriptions proven to work (search_api went from invisible to 1st result). Plan: restore keyword-rich descriptions for 85 tools without restoring bloat. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

7 lessons learned from MCP tool discovery at scale (62→142 tools): description compression was counterproductive (ToolSearch existed but we didn't know), tags are inert, typed tools beat generic for discovery, server instructions are the biggest lever (44%→83%), progressive tests reveal structural limits. Full timeline, metrics, PR history, citations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

22 tools.py files to enrich, README client compatibility section, tool count updates, new min description length test. Recover keywords from pre-compression commit (a78d308). No tool removal or architecture changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

22 tools.py files enriched with domain keywords, use cases, field lists. Reverses description compression (a78d308) that hurt ToolSearch matching. All tools now have ≥40 char first-line descriptions. README: tool count 134→142, add client compatibility table (Cursor not compatible at 40-tool cap). CLAUDE.md + server.py: count 138→142. New test: test_min_description_length enforces ≥40 chars. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All 142 tool descriptions enriched, min 40-char first line enforced, README/CLAUDE.md/server.py updated, client compatibility table added. 11/12 LLM tests pass. search_api + search_wiring_patterns discoverable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Same 7 known flaky failures. No regression from enriching all 142 tool descriptions. Confirms description changes are safe. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ferences tool-workflows: add HVAC measure verification step add-hvac: add custom HVAC wiring section troubleshoot: add SDK method verification section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tool sections now sum to 142. Added: validate_model, extract_simulation_errors, list_output_variables, compare_runs, list_weather_files, search_api, search_wiring_patterns, recommend_tools. Added /troubleshoot to skills table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…emphasis Audit: 82% of tools have no when-to-use guidance, 93% no negative scope, only 3 use emphasis keywords. Plan covers 142 tools across 4 tiers with specific confusion pairs, L1 failure analysis, and emphasis targets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pre-ToolSearch guidance (2024) said "extremely detailed." Post-ToolSearch guidance (2025) says "semantic keywords." These conflict — verbose descriptions may hurt ToolSearch discovery. Target only confusion pairs (16), L1 failures (7), bypass-prone (8), shortest (12). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ovement Before/after benchmark: 11/15 (73.3%) both runs. 8/8 confusion pairs pass. 4 L1 failures unchanged — structural prompt ambiguity, not description quality. Added when-to-use, negative scope, emphasis to ~35 tools. Agent's alternative tool choices are reasonable for vague prompts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tool routing, API reference, wiring recipes, description enrichment

- run_qaqc_L1: accept validate_model (correct for pre-sim check) - list_dynamic_type_L1: accept get_sizing_*_properties (more specific) - replace_windows_L1: accept list_common_measures, materials exploration - check_loads_L1: accept get_space_details (contains loads) Remove 4 from FLAKY_TESTS — expanded expected sets make them stable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openstudio-patterns: replace create_example/baseline_osm with create_new_building as primary, add validate_model + search_api/search_wiring_patterns sections new-building: manual workflow uses create_bar_building not create_example_osm qaqc: add validate_model as first step energy-report: add generate_results_report, compare_runs, extract_simulation_errors retrofit: add compare_runs tool for step 5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ured Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… passes Codex-reviewed all 67 test files against testing rules. Fixed: - 8 critical: unfalsifiable tests (isinstance(ok,bool), conditional silent pass) - 17 high: tautological/existence-only assertions, missing value checks - 17 medium: weak error paths, missing payload validation - 4 additional findings from plan review (test_add_ev_load, test_add_zone_ventilation, test_list_files_items, test_get_air_loop_details) Key changes: - test_hvac.py: use baseline+System7 model so HVAC is guaranteed, remove if-guards - test_common_measures.py: replace isinstance(ok,bool) with ok=True+validate or skip/fail - test_path_safety.py: monkeypatch Popen for deterministic staging, unconditional asserts - test_component_controls.py: fix SPM lookup (was searching wrong dict level + wrong type) - test_building.py: NaN/Inf guard now uses math.isfinite not isinstance - test_skill_retrofit.py: actually compares baseline vs retrofit energy metrics - All error-path tests: assert error message content, not just key existence Also: test_replace_window_constructions now filters for window constructions, test_set_setpoint_min_max_temp adapts properties to actual SPM type found. 2 thermostat tests still skip — genuine tool bug (Choice-type args passed as String in OSW), tracked in #40. 270 passed, 3 skipped, 0 failed in Docker. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Root cause: wrappers passed schedule names to measures without checking they match the measure's Choice list filter (e.g. Temperature unitType). OSW runner error "type String while Choice was expected" actually means value not in Choice list — misleading. - _resolve_handle → _resolve_choice_name (returns nameString) - add _validate_schedule: checks exists, is Schedule, has type limits, optionally validates unitType - thermostat wrappers: reject non-Temperature schedules with clear error - add_zone_ventilation: require schedule_name (measure vent_sch mandatory) - tests: use correct Temperature-type schedules, remove lenient skips - 2 new tests: bad schedule type + missing schedule validation 25/25 test_common_measures pass (was 22 pass + 3 skip) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Audit found tests passing despite tool issues — same pattern as #40 where wrong schedule type was masked by lenient skips. - test_common_measures: tautological >= → >, ok-only → runner_messages check, if ok: pass → assert result=="Success", remove unfalsifiable LifeCycleCost readbacks (unsupported type) - test_response_sizes: fixture if-guards → asserts (except air_loops which needs HVAC), skip-on-low-count → assert, tautological <= → value - test_hvac: >= 0 tautological → >= 1, remove redundant isinstance - test_hvac_validation: System 8 HW loop exists (not absent), PFP terminals == len(zones) not > 0, unit heaters >= len(zones) - test_component_controls: if prop in changes → assert prop in changes 149 pass, 2 skip (add_pv_to_shading: no shading surfaces in baseline, filter_zones_by_air_loop: fixture has no HVAC) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Post #40 fix + test audit. 7 previously-flaky L1s now passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove 5 conditional silent-pass patterns in test_path_safety - Add pytest.approx() to 8 bare float comparisons across 6 files - Replace 20 `is not None` checks with non-empty string assertions - Add independent readback assertions for 5 tautological echo tests - Add @pytest.mark.integration to test_validate_model - Move SDK-dependent test from unit file to test_object_management Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Title: Harden test suite + fix Choice-type measure arg validation

…tries default 0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

OSMCP_CODE_MODE env var gates fastmcp CodeMode transform; off by default. Runner/conftest detect code_mode_active, track tools called inside execute blocks, and report CodeMode ON/OFF in benchmark md. Bump fastmcp>=3.1.0 for experimental transform. Includes A/B sweep data (off=95.3%, on=24.0%) and root-cause writeup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move llm-test-benchmark, testing, frameworks-summary, benchmark- description-guidance under docs/testing/. Add README.md as technical report with 7 embedded plots (run history, tier pass rates, progressive L1/L2/L3, token profile, failure modes, cross-model sweep, codemode A/B) + paragraph explanations and legends. Include generate_plots.py + march/april sweep data it sources. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New research notes under docs/knowledge/: architecture/testing patterns, MCP best-practices gap analysis, reddit discovery thread, APS agent paper, tool-discovery/LLM-testing writeup. Move geometry research into knowledge/. Drop stale development-process-findings and tool-discovery-research. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Redirect C-level stdout (fd 1) to stderr once at startup, give Python sys.stdout a private fd to the real MCP client pipe. Catches ALL C-level pollution (SWIG GC, Polyhedron geometry, future unknowns) with zero races and no per-callsite wrappers. Fixes concurrent tool timeout (issue #42) and Polyhedron stdout leak on complex models (test_complex_model_stdout_purity). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge optimize → develop

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

brianlball and others added 30 commits March 15, 2026 08:47

rebalance CI shards: move hvac_validation from shard 2 to 5

a411295

Shard 2 was bottleneck at 9:10, shard 5 idle at 1:04. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add list_weather_files to EXPECTED_TOOLS registry

4523f12

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

move test_bar_building (221s) from shard 2 to shard 5

493b005

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add testing frameworks summary doc

2906448

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

archive completed tool routing plan

eccf3aa

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix FakeMCP.tool() missing **kwargs in test_skill_docs

e982cdd

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

archive completed debug session fixes plan (all 6 items done)

e8b022a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update research doc: problem resolved, LLM tests 12/12 pass

d5faba5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update benchmark: Run 12 — 163/170 (95.9%) post description enrichment

6194520

Same 7 known flaky failures. No regression from enriching all 142 tool descriptions. Confirms description changes are safe. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update docs with research references, remove stale plan

00f595d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

brianlball and others added 23 commits March 20, 2026 14:46

Merge pull request #39 from NatLabRockies/optimize

e587ba4

Tool routing, API reference, wiring recipes, description enrichment

archive description guidance plan — completed, no L1 improvement meas…

9062bfd

…ured Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add plan: remote multi-user MCP server via Streamable HTTP

86c0e92

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update LLM benchmark: Run 13 — 160/167 passed (95.8%)

e84a765

Post #40 fix + test audit. 7 previously-flaky L1s now passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge pull request #41 from NatLabRockies/optimize

e4afb43

Title: Harden test suite + fix Choice-type measure arg validation

improve LLM benchmark: failure mode analysis, ToolSearch overhead, re…

b560e6e

…tries default 0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge pull request #46 from NatLabRockies/optimize

a517102

Merge optimize → develop

bump version 0.8.2 → 0.9.0, add CHANGELOG.md

8ae7c7a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

brianlball merged commit 7ec508c into main Apr 10, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.9.0#47

Release v0.9.0#47
brianlball merged 53 commits into
mainfrom
develop

brianlball commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brianlball commented Apr 10, 2026

Release v0.9.0

Highlights

New skills & tools

Bug fixes

Testing

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant