Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
a12a3ae
harden list_files to /inputs+/runs only, add list_weather_files tool
brianlball Mar 15, 2026
a411295
rebalance CI shards: move hvac_validation from shard 2 to 5
brianlball Mar 15, 2026
4523f12
add list_weather_files to EXPECTED_TOOLS registry
brianlball Mar 15, 2026
493b005
move test_bar_building (221s) from shard 2 to shard 5
brianlball Mar 15, 2026
dcc52ec
add parameterized measure quality tests + fix Agent in BUILTIN_TOOLS
brianlball Mar 15, 2026
7e79c7c
fix measure authoring bugs + add agent guardrails against tool bypass
brianlball Mar 16, 2026
a58f2a0
fix debug session issues #1-4: per-fuel compare_runs, climate_zone gu…
brianlball Mar 19, 2026
2906448
add testing frameworks summary doc
brianlball Mar 19, 2026
39d7608
add tool routing: search_api, recommend_tools, tags on all 141 tools,…
brianlball Mar 19, 2026
eccf3aa
archive completed tool routing plan
brianlball Mar 19, 2026
e982cdd
fix FakeMCP.tool() missing **kwargs in test_skill_docs
brianlball Mar 20, 2026
2863d43
add search_wiring_patterns: 24 HVAC wiring recipes from openstudio-re…
brianlball Mar 20, 2026
fbf3338
add LLM discovery hints for search_api + search_wiring_patterns
brianlball Mar 20, 2026
7e327d6
add wiring recipe tests: search accuracy (17 cases) + recipe quality …
brianlball Mar 20, 2026
84ad45b
add LLM tests for search_api + search_wiring_patterns discovery
brianlball Mar 20, 2026
26770dc
revert lenient assertions — keep strict search_api/search_wiring_patt…
brianlball Mar 20, 2026
194b7cd
add tool discovery research + ToolSearch test results
brianlball Mar 20, 2026
c09d6ee
fix tool discovery: Docker rebuild + enriched descriptions make all t…
brianlball Mar 20, 2026
e8b022a
archive completed debug session fixes plan (all 6 items done)
brianlball Mar 20, 2026
d5faba5
update research doc: problem resolved, LLM tests 12/12 pass
brianlball Mar 20, 2026
cdf4243
update benchmark: Run 11 — 164/171 (95.9%) with ToolSearch, all test_…
brianlball Mar 20, 2026
b0bbd41
add tool consolidation plan: descriptions, consolidation, split options
brianlball Mar 20, 2026
ddc0ce0
add deferred plan: multi-MCP server split with profile-based registra…
brianlball Mar 20, 2026
da8a583
rewrite consolidation plan: enrich descriptions, don't remove typed t…
brianlball Mar 20, 2026
c94ef8b
add development process findings for journal article
brianlball Mar 20, 2026
b07e0df
detailed plan: enrich 85 tool descriptions + update README/CLAUDE.md
brianlball Mar 20, 2026
1cd155c
enrich all 142 tool descriptions for ToolSearch discovery + update docs
brianlball Mar 20, 2026
4bba9eb
archive tool consolidation plan (description enrichment complete)
brianlball Mar 20, 2026
6194520
update benchmark: Run 12 — 163/170 (95.9%) post description enrichment
brianlball Mar 20, 2026
00f595d
update docs with research references, remove stale plan
brianlball Mar 20, 2026
f717076
update Claude Code skills with search_api + search_wiring_patterns re…
brianlball Mar 20, 2026
0a83151
fix README: add 8 missing tools, add /troubleshoot skill, update counts
brianlball Mar 20, 2026
780b58e
plan: tool description usage guidance — when-to-use, negative scope, …
brianlball Mar 20, 2026
b6b0027
revise plan: targeted guidance on ~35 tools, not all 142
brianlball Mar 20, 2026
ed3635e
add confusion pair tests + targeted description guidance — no L1 impr…
brianlball Mar 20, 2026
e587ba4
Merge pull request #39 from NatLabRockies/optimize
brianlball Mar 20, 2026
b27a24d
fix 4 L1 test expectations: accept agent's reasonable alternative tools
brianlball Mar 20, 2026
f7e226b
fix skills: correct tool recommendations, add missing tools
brianlball Mar 20, 2026
9062bfd
archive description guidance plan — completed, no L1 improvement meas…
brianlball Mar 20, 2026
86c0e92
add plan: remote multi-user MCP server via Streamable HTTP
brianlball Mar 22, 2026
ace9ec8
fix tests: 46 quality findings — strengthen assertions, remove silent…
brianlball Mar 26, 2026
b6232ff
fix #40: validate Choice-type measure args in wrappers
brianlball Mar 26, 2026
55c698d
fix tests: remove patterns that hide MCP method bugs
brianlball Mar 26, 2026
e84a765
update LLM benchmark: Run 13 — 160/167 passed (95.8%)
brianlball Mar 27, 2026
e572961
fix tests: 15 rule violations from Codex review
brianlball Mar 27, 2026
e4afb43
Merge pull request #41 from NatLabRockies/optimize
brianlball Mar 27, 2026
b560e6e
improve LLM benchmark: failure mode analysis, ToolSearch overhead, re…
brianlball Apr 5, 2026
3679b57
add CodeMode toggle (default off) + LLM harness support
brianlball Apr 6, 2026
7878a0b
reorganize testing docs into docs/testing/ + technical report
brianlball Apr 6, 2026
14a77d6
add knowledge docs: MCP research, best-practices gap, tool discovery
brianlball Apr 6, 2026
2395d95
fix: permanent fd redirect for stdout suppression (issue #42)
brianlball Apr 10, 2026
a517102
Merge pull request #46 from NatLabRockies/optimize
brianlball Apr 10, 2026
8ae7c7a
bump version 0.8.2 → 0.9.0, add CHANGELOG.md
brianlball Apr 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .claude/skills/add-hvac/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,14 @@ Guide the user through selecting and applying an HVAC system to their model.

6. Report what was created: system name, zones served, equipment types, plant loops.

## Custom HVAC Wiring

For custom HVAC configurations beyond the baseline templates:
```
search_wiring_patterns("DOAS") # get working Ruby wiring code
search_api("CoilCoolingFourPipeBeam") # verify SDK method names
```

## Notes

- Get all zone names from `list_thermal_zones()` — names must match exactly
Expand Down
15 changes: 13 additions & 2 deletions .claude/skills/energy-report/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,28 @@ Extract all result categories from a completed simulation and present a structur

1. Identify the run. If user provides a run_id, use it. Otherwise check for the most recent simulation.

2. Extract all result categories:
2. For an HTML report with ~25 sections (fastest):
```
generate_results_report(run_id=<id>)
```

3. Or extract individual categories for custom analysis:
```
extract_summary_metrics(run_id=<id>)
extract_end_use_breakdown(run_id=<id>)
extract_envelope_summary(run_id=<id>)
extract_hvac_sizing(run_id=<id>)
extract_zone_summary(run_id=<id>)
extract_component_sizing(run_id=<id>)
extract_simulation_errors(run_id=<id>)
```

4. For before/after comparison:
```
compare_runs(baseline_run_id=<id1>, retrofit_run_id=<id2>)
```

3. Optionally run QA/QC:
5. Optionally run QA/QC:
```
run_qaqc_checks()
```
Expand Down
8 changes: 4 additions & 4 deletions .claude/skills/new-building/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,13 @@ Step 3 — Typical building (same as Workflow B step 4)

For fully custom buildings not matching DOE prototypes:

1. `create_example_osm(name="<name>")` or `create_baseline_osm(name="<name>")`
2. Create geometry with `create_space_from_floor_print` + `match_surfaces`
1. `load_osm_model` an empty model, or start with `create_bar_building` for basic geometry
2. Create/refine geometry with `create_space_from_floor_print` + `match_surfaces`
3. Add glazing with `set_window_to_wall_ratio`
4. Create materials/constructions/loads manually
5. Add HVAC with `add_baseline_system`
6. Set weather + design days
7. Simulate
6. Set weather with `change_building_location`
7. Check with `validate_model`, then simulate

## Simulation

Expand Down
30 changes: 24 additions & 6 deletions .claude/skills/openstudio-patterns/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Weather (EPW + design days, needed before simulation)

## Typical Model Build Order

1. **Create or load model** — `create_example_osm` / `create_baseline_osm` / `load_osm_model`
1. **Create or load model** — `create_new_building` (recommended) / `load_osm_model` / `create_bar_building`
2. **Geometry** — `create_space_from_floor_print` (preferred) or `create_space` + `create_surface`
3. **Match surfaces** — `match_surfaces` after all spaces created (finds shared walls)
4. **Thermal zones** — `create_thermal_zone` with `space_names`
Expand Down Expand Up @@ -84,11 +84,29 @@ Weather (EPW + design days, needed before simulation)

| Goal | Tool | Notes |
|------|------|-------|
| Quick test model (1 zone) | `create_example_osm` | Minimal geometry, no HVAC |
| Baseline with HVAC (10 zones) | `create_baseline_osm` | Includes ASHRAE system, geometry, schedules |
| Custom geometry | `create_space_from_floor_print` | Preferred — auto-creates walls, floor, ceiling from polygon |
| Explicit surfaces | `create_surface` | Use only when floor print extrusion won't work |
| Typical building (standards-based) | `create_typical_building` | ComStock measure, adds constructions + loads + HVAC + schedules |
| Production building model | `create_new_building` | End-to-end: geometry + weather + HVAC + loads. Recommended starting point. |
| Custom geometry only | `create_bar_building` | Bar geometry from building type/area. Follow with `create_typical_building` for loads+HVAC. |
| Custom floor plan | `create_space_from_floor_print` | Extrude polygon into 3D space. Use for non-rectangular geometry. |
| Standards template on existing geometry | `create_typical_building` | Adds constructions + loads + HVAC + schedules to model with geometry. |
| Import from FloorSpaceJS | `import_floorspacejs` | Load custom geometry JSON, then `create_typical_building` for loads+HVAC. |
| Quick test (1 zone, no HVAC) | `create_example_osm` | Testing/demos only. |
| Baseline test (10 zones) | `create_baseline_osm` | Testing/demos only. |

## Pre-Simulation Checklist

Before `run_simulation`, call `validate_model` to verify:
- Weather file set (EPW)
- Design days present (from DDY)
- HVAC assigned to zones
- Constructions on surfaces

## HVAC Measure Authoring

Before writing measures that create HVAC objects:
```
search_api("CoilCoolingFourPipeBeam") # verify real method names
search_wiring_patterns("four pipe beam") # get working connection code
```

## Common Error Patterns

Expand Down
12 changes: 9 additions & 3 deletions .claude/skills/qaqc/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,19 @@ Inspect the current model for common issues before running a simulation.

## Steps

1. Get model overview:
1. Quick automated check:
```
validate_model()
```
Checks weather, design days, HVAC, constructions in one call.

2. Get model overview:
```
inspect_osm_summary()
get_model_summary()
get_building_info()
```

2. Check for missing critical elements:
3. Check for missing critical elements:
- **Zones without HVAC:** `list_thermal_zones()` — look for zones with no equipment
- **Spaces without zones:** `list_spaces()` — look for spaces not assigned to a thermal zone
- **Missing constructions:** `list_surfaces()` — look for surfaces without constructions
Expand Down
9 changes: 5 additions & 4 deletions .claude/skills/retrofit/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,10 +65,11 @@ extract_end_use_breakdown(run_id=<retrofit_id>)
```

### 5. Compare Results
Present side-by-side comparison:
- EUI change (absolute and percentage)
- End-use breakdown delta (which categories improved)
- Unmet hours change (ensure comfort wasn't sacrificed)
```
compare_runs(baseline_run_id=<baseline_id>, retrofit_run_id=<retrofit_id>)
```
Returns EUI delta, per-fuel end-use breakdown, and unmet hours change.
For manual comparison, use `extract_summary_metrics` on both runs.

## Notes

Expand Down
6 changes: 6 additions & 0 deletions .claude/skills/tool-workflows/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,12 @@ extract_summary_metrics(run_id=<retrofit_id>)

See the `measure-authoring` skill for run_body patterns and language guidance.

For HVAC measures, verify methods exist and get wiring code first:
```
search_api("CoilCoolingFourPipeBeam") # check real setter/getter names
search_wiring_patterns("four pipe beam") # get working Ruby wiring code
```

## Write and Apply a Custom ReportingMeasure

ReportingMeasures run after simulation to analyze SQL results.
Expand Down
8 changes: 8 additions & 0 deletions .claude/skills/troubleshoot/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,14 @@ query_timeseries(run_id=..., variable_name="Zone Mean Air Temperature",
frequency="Hourly", key_value="Zone 1")
```

## Verify SDK Methods

If a measure fails due to nonexistent API methods:
```
search_api("CoilCoolingFourPipeBeam") # list real setters/getters
search_api("BoilerHotWater", method_pattern="Efficiency")
```

## Quick Fixes

| Problem | Tool |
Expand Down
12 changes: 6 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -62,17 +62,17 @@ jobs:
case ${{ matrix.shard }} in
1)
# sim test + component/weather + loop ops + skill_retrofit
FILES="tests/test_example_workflows.py tests/test_component_properties.py tests/test_comstock.py tests/test_weather.py tests/test_mcp_seb4.py tests/test_create_constructions.py tests/test_loop_operations.py tests/test_plant_loop_demand.py tests/test_sizing_properties.py tests/test_skill_retrofit.py tests/test_integration.py"
FILES="tests/test_example_workflows.py tests/test_component_properties.py tests/test_comstock.py tests/test_weather.py tests/test_weather_files.py tests/test_mcp_seb4.py tests/test_create_constructions.py tests/test_loop_operations.py tests/test_plant_loop_demand.py tests/test_sizing_properties.py tests/test_skill_retrofit.py tests/test_integration.py"
EXTRA_ENV="-e MCP_OSW_PATH=tests/assets/SEB_model/SEB4_baseboard/workflow.osw -e EXPECTED_EUI=1.8750760248144998 -e EXPECTED_EUI_RTOL=0.02 -e EXPECTED_EUI_ATOL=0.0"
;;
2)
# common_measures, hvac_systems, geometry, zone terminal, skill_energy_report, hvac_validation (consolidated)
FILES="tests/test_common_measures.py tests/test_hvac_systems.py tests/test_replace_zone_terminal.py tests/test_geometry.py tests/test_bar_building.py tests/test_skill_energy_report.py tests/test_hvac_validation.py"
# common_measures, hvac_systems, geometry, zone terminal, skill_energy_report
FILES="tests/test_common_measures.py tests/test_hvac_systems.py tests/test_replace_zone_terminal.py tests/test_geometry.py tests/test_skill_energy_report.py"
EXTRA_ENV=""
;;
3)
# controls, object mgmt, loads, building, doas, hvac, measures, measure_authoring, skill_qaqc, hvac_supply_wiring
FILES="tests/test_component_controls.py tests/test_object_management.py tests/test_generic_access.py tests/test_create_loads.py tests/test_building.py tests/test_doas_system.py tests/test_hvac.py tests/test_measures.py tests/test_measure_authoring.py tests/test_skill_qaqc.py tests/test_hvac_supply_wiring.py tests/test_validate_model.py"
FILES="tests/test_component_controls.py tests/test_object_management.py tests/test_generic_access.py tests/test_create_loads.py tests/test_building.py tests/test_doas_system.py tests/test_hvac.py tests/test_measures.py tests/test_measure_authoring.py tests/test_skill_qaqc.py tests/test_hvac_supply_wiring.py tests/test_validate_model.py tests/test_api_reference.py"
EXTRA_ENV=""
;;
4)
Expand All @@ -81,8 +81,8 @@ jobs:
EXTRA_ENV=""
;;
5)
# HVAC supply wiring simulation smoke tests (DOAS, radiant, district, beams)
FILES="tests/test_hvac_supply_sim.py"
# HVAC supply sim smoke tests + hvac_validation + bar_building + concurrent regression
FILES="tests/test_hvac_supply_sim.py tests/test_hvac_validation.py tests/test_bar_building.py tests/test_concurrent_tools.py"
EXTRA_ENV=""
;;
esac
Expand Down
80 changes: 80 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Changelog

## [0.9.0] - 2026-04-10

### Added
- **Geometry tools**: `create_bar_building`, `create_new_building`, `import_floorspacejs` for model creation from DOE prototypes and FloorSpaceJS JSON
- **Generic object access**: `get_object_fields`, `set_object_property`, dynamic `list_model_objects` for any OpenStudio type
- **Measure authoring skill**: `create_measure`, `edit_measure`, `test_measure` with ReportingMeasure support
- **Tool routing**: `search_api` (OpenStudio SDK search), `recommend_tools`, `search_wiring_patterns` (24 HVAC wiring recipes)
- **HVAC components**: FourPipeBeam and CooledBeam air terminals, `set_zone_equipment_priority`
- **LLM test suite**: 170+ tests across 5 tiers with progressive difficulty (L1 vague / L2 moderate / L3 explicit), cross-model benchmark sweeps (sonnet/opus/haiku), CodeMode A/B comparison
- **Concurrent tool regression test**: validates MCP responses under concurrent tool calls
- **Stdout purity test**: validates no C-level pollution on complex 44-zone models
- **Response-size guardrails**: `max_results` + filters on all list tools, brief mode for large responses
- **Agent guardrails**: anti-loop instructions in MCP server, tool-bypass prevention
- Tags on all 142 tools for ToolSearch discovery
- Enriched tool descriptions for better LLM tool selection
- `list_weather_files` tool, `validate_model` tool, `extract_simulation_errors` tool
- `compare_runs` tool for two-simulation comparison
- CI expanded to 5 shards, ~450+ integration tests

### Fixed
- **Concurrent tool timeout (issue #42)**: permanent fd redirect replaces racy global middleware — C stdout goes to stderr once at startup, Python sys.stdout gets private fd to MCP client
- **Polyhedron stdout leak**: OpenStudio geometry engine C++ diagnostics no longer corrupt JSON-RPC stream
- SWIG memory leak warnings fully suppressed across all callsites
- Measure XML stale checksums causing OS App rejection
- Choice-type measure argument validation in wrappers
- JSON-string list params across 9 affected tools (`parse_str_list()`)
- `conditioned_floor_area` computed from model instead of hardcoded
- EUI units now report MJ/m2 + kBtu/ft2 alongside GJ/m2

### Changed
- `list_files` hardened to `/inputs` + `/runs` only
- `change_building_location` preferred over `set_weather_file` (sets EPW+DDY+CZ in one call)
- Consolidated 4 HVAC validation test files into single `test_hvac_validation.py`
- Consolidated integration tests: -8 files, -57 Docker sessions

## [0.8.2] - 2026-03-28

### Added
- Tool description enrichment for all 142 tools
- CodeMode toggle (default off) with LLM harness support

## [0.8.0] - 2026-03-13

### Added
- Measure authoring skill with test framework
- SWIG stdout suppression middleware (replaced in 0.9.0)
- Phase 10 results tools: `extract_simulation_errors`, `list_output_variables`, `compare_runs`

## [0.7.0] - 2026-03-07

### Added
- LLM agent test suite (170+ tests, local-only)
- Geometry workflows (FloorSpaceJS import, bar building)

## [0.6.0] - 2026-02-28

### Added
- Response-size guardrails on all list tools
- Generic object access (Phase C)

## [0.5.0] - 2026-02-21

### Added
- Agent guardrails (anti-loop, tool-bypass prevention)
- Weather file improvements

## [0.4.0] - 2026-02-14

### Added
- Common measures integration (20 measures, 11 wrapper tools)
- Context reduction (auto-load, brief mode, batch removal)

## [0.3.0] - 2026-02-07

### Added
- Initial skills architecture (22 skills, 126 tools)
- 5-shard CI pipeline
- OpenStudio SDK 3.11.0 integration
25 changes: 13 additions & 12 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# CLAUDE.md — Instructions for Claude Code

always be brutally honest
## Project: openstudio-mcp
MCP server giving AI agents full control of building energy modeling —
create buildings, author measures, configure HVAC, run EnergyPlus sims, extract
results — all through 138 MCP tools backed by the OpenStudio SDK.
results — all through 142 MCP tools backed by the OpenStudio SDK.

## Critical: Use MCP Tools — Do Not Reinvent
Always use openstudio-mcp tools for BEM tasks:
Expand All @@ -17,15 +17,16 @@ Always use openstudio-mcp tools for BEM tasks:
1. Keep files under ~250 lines — don't split artificially just to hit a number
2. Every MCP tool must have an integration test. New behavior, bug fixes, and security hardening need tests too — not just the happy path
3. Integration tests must be added to `.github/workflows/ci.yml` — append to the lightest shard's `FILES=` list (5 shards, keep balanced ~200s each)
4. Operations return `{"ok": True/False, ...}` — never raise through MCP
5. Use `openstudio` Python bindings directly
6. All OpenStudio attribute access must handle `is_initialized()` checks
7. `_extract_*` functions return dicts with `snake_case` keys matching OpenStudio attribute names
8. Tool functions keep `_tool` suffix internally; MCP-visible names strip it via `@mcp.tool(name="...")`
9. Never commit generated/temp files — `.gitignore` covers `__pycache__/`, `*.pyc`, `runs/`, `.claude/`, `.pytest_cache/`. Test artifacts go to `runs/`. Only permanent reference models go in `tests/assets/`
10. Bundled measures get wrapper tools with typed args — don't expose raw `apply_measure` as primary interface
11. No `getattr()` or string-based dispatch — every OpenStudio API method called directly (grepable, lintable, visible in stack traces)
12. MCP clients may send `list[str]` as JSON strings — use `list[str] | str` type annotation + `parse_str_list()` from `osm_helpers.py`
4. Follow testing rules in `.claude/rules/testing.md`. Critical: every test needs `# Regression:` or `# Validates:` comment; never delete failing tests or weaken assertions; assert exact values not existence; integration tests mock nothing; unit tests never import `openstudio`
5. Operations return `{"ok": True/False, ...}` — never raise through MCP
6. Use `openstudio` Python bindings directly
7. All OpenStudio attribute access must handle `is_initialized()` checks
8. `_extract_*` functions return dicts with `snake_case` keys matching OpenStudio attribute names
9. Tool functions keep `_tool` suffix internally; MCP-visible names strip it via `@mcp.tool(name="...")`
10. Never commit generated/temp files — `.gitignore` covers `__pycache__/`, `*.pyc`, `runs/`, `.claude/`, `.pytest_cache/`. Test artifacts go to `runs/`. Only permanent reference models go in `tests/assets/`
11. Bundled measures get wrapper tools with typed args — don't expose raw `apply_measure` as primary interface
12. No `getattr()` or string-based dispatch — every OpenStudio API method called directly (grepable, lintable, visible in stack traces)
13. MCP clients may send `list[str]` as JSON strings — use `list[str] | str` type annotation + `parse_str_list()` from `osm_helpers.py`

## Architecture
- Each skill lives in `mcp_server/skills/<name>/`
Expand Down Expand Up @@ -72,7 +73,7 @@ docker run --rm \
- Targeted: `LLM_TESTS_ENABLED=1 pytest tests/llm/test_06_progressive.py -k "thermostat_L1" -v`
- Full suite only for final validation
- Markers: `-m smoke` (12), `-m generic` (10), `-m progressive` (102)
- Benchmark results go in `docs/llm-test-benchmark.md`
- Benchmark results go in `docs/testing/llm-test-benchmark.md`

### Local Development
- Lint: `ruff check mcp_server/`
Expand Down
Loading
Loading