Releases: Hmbown/Butterfly
Releases · Hmbown/Butterfly
Wayfinder v0.1.0 (GLM-first public release)
First Public Release (2026-02-18)
This release ships one safe default path and separates validated behavior from experimental and known-regression slices. The public scope is primarily a GLM-4.7-Flash-4bit stable-path release.
Public Default
- Stable command:
./scripts/run_public_stable_profile_glm.sh - Model default:
mlx-community/GLM-4.7-Flash-4bit - Runtime defaults:
seq_len=8192,decode_len=32,repeats=1 - Conservative flags:
--skip-multi-turn --skip-quality - Retro/backfill inference default remains off.
- Execution policy: run one benchmark/inference command at a time.
Support Matrix
| Tier | Status | Scope | Decision |
|---|---|---|---|
| Validated (default) | Recommended | GLM consumer benchmark path + stable wrapper | Keep as default |
| Experimental (opt-in) | Not default | Qwen and Nanbeige diagnostic sweeps | Keep opt-in only |
| Known regression / non-default | Do not default | Nanbeige T=131072, decode_len=256 |
Keep non-default |
Validated Evidence (2026-02-18)
Stable wrapper run EXP-20260218T151213Z-STABLE-PROFILE:
- Dense:
e2e=17.1473s,prefill=16.3586s,decode=0.7886s,decode_tok_s=40.5762,peak_memory=20,660,500,140 - Wayfinder:
e2e=10.5563s,prefill=9.7533s,decode=0.8030s,decode_tok_s=39.8499,peak_memory=20,071,482,232 - Delta (wayfinder vs dense):
e2e=-38.44%prefill=-40.38%decode=+1.82%decode_tok_s=-1.79%peak_memory=-2.85%(wayfinder lower)
- Memory convention:
100*(1-wayfinder/dense)=+2.8509%(reduction)
Artifacts:
/Volumes/VIXinSSD/wayfinder/benchmarks/mlx/first_release/EXP-20260218T151213Z-STABLE-PROFILE/dense/results.json/Volumes/VIXinSSD/wayfinder/benchmarks/mlx/first_release/EXP-20260218T151213Z-STABLE-PROFILE/wayfinder/results.json/Volumes/VIXinSSD/wayfinder/benchmarks/mlx/first_release/EXP-20260218T151213Z-STABLE-PROFILE/stable_profile_summary.json/Volumes/VIXinSSD/wayfinder/benchmarks/mlx/first_release/EXP-20260218T151213Z-STABLE-PROFILE/stable_profile_summary.md
Experimental Boundary Closure (2026-02-18)
Nanbeige instrumented pair T=131072, decode_len=32:
- Dense:
e2e=467.1097s,prefill=435.8845s,decode=31.2253s,decode_tok_s=1.0248,peak_memory=18,460,513,312 - Wayfinder trace:
e2e=471.7444s,prefill=425.2292s,decode=46.5151s,decode_tok_s=0.6879,peak_memory=18,474,229,212 - Delta (wayfinder vs dense):
e2e=+0.99%prefill=-2.44%decode=+48.97%decode_tok_s=-32.87%peak_memory=+0.07%
- Memory convention:
100*(1-wayfinder/dense)=-0.0743%(slight increase) - Fallback diagnostics:
path_counts={permute:264, permute_dense_fallback:248},dense_fallback_reason_counts={active_large_q:248},dense_fallback_share_run=0.484375
Decision: keep this long-boundary Nanbeige path experimental/non-default.
Artifacts:
/Volumes/VIXinSSD/wayfinder/benchmarks/mlx/nanbeige4_1_3b_wayfinder/hang_debug_20260218T151213Z/instrumented_trace32_T131072/results.json/Volumes/VIXinSSD/wayfinder/benchmarks/mlx/nanbeige4_1_3b_wayfinder/hang_debug_20260218T151213Z/instrumented_dense32_T131072/results.json
Known Regression (2026-02-18)
Nanbeige T=131072, decode_len=256 paired instrumented result:
e2e=+44.63%prefill=+13.26%decode=+463.08%decode_tok_s=-82.24%peak_memory=+51.25%
Decision: non-default, do not promote.
Artifacts:
/Volumes/VIXinSSD/wayfinder/benchmarks/mlx/nanbeige4_1_3b_wayfinder/hang_debug_20260218/instrumented_wayfinder256_T131072/results.json/Volumes/VIXinSSD/wayfinder/benchmarks/mlx/nanbeige4_1_3b_wayfinder/hang_debug_20260218/instrumented_dense256_T131072_rerun/results.json
Exact Reproduction Commands
Run sequentially, one command at a time.
./scripts/verify_install_and_preflight.sh --run-id EXP-YYYYMMDDTHHMMSSZ-VERIFY-INSTALL --out-dir benchmarks/mlx/preflight
python3 scripts/bench_glm_consumer_mlx.py --mode dense --seq-lens 2048 --decode-len 8 --repeats 1 --skip-multi-turn --skip-quality --out-dir benchmarks/mlx/first_release/first_run_dense_t2048
./scripts/run_public_stable_profile_glm.shNanbeige boundary instrumentation pair used for gate closure:
python3 scripts/bench_qwen_consumer_mlx.py --model-path Nanbeige/Nanbeige4.1-3B --mode wayfinder --seq-lens 131072 --decode-len 32 --repeats 1 --skip-multi-turn --skip-quality --hsa-trace --window 64 --head-chunk-size 2 --query-chunk-size 384 --stage-timeout-sec 900 --heartbeat-sec 30 --out-dir benchmarks/mlx/nanbeige4_1_3b_wayfinder/hang_debug_20260218T151213Z/instrumented_trace32_T131072
python3 scripts/bench_qwen_consumer_mlx.py --model-path Nanbeige/Nanbeige4.1-3B --mode dense --seq-lens 131072 --decode-len 32 --repeats 1 --skip-multi-turn --skip-quality --stage-timeout-sec 900 --heartbeat-sec 30 --out-dir benchmarks/mlx/nanbeige4_1_3b_wayfinder/hang_debug_20260218T151213Z/instrumented_dense32_T131072Artifact Layout
- Install verify:
<out_dir>/<RUN_ID>_env_check_mlx.json<out_dir>/<RUN_ID>_summary.json<out_dir>/<RUN_ID>_raw.txt
- Stable profile:
benchmarks/mlx/first_release/<RUN_ID>/dense/results.jsonbenchmarks/mlx/first_release/<RUN_ID>/wayfinder/results.jsonbenchmarks/mlx/first_release/<RUN_ID>/stable_profile_summary.jsonbenchmarks/mlx/first_release/<RUN_ID>/stable_profile_summary.md
- Boundary diagnostics:
benchmarks/mlx/nanbeige4_1_3b_wayfinder/hang_debug_<TS>/instrumented_trace32_T131072/results.jsonbenchmarks/mlx/nanbeige4_1_3b_wayfinder/hang_debug_<TS>/instrumented_dense32_T131072/results.json
Decision Rules
- If command exit is nonzero, missing
results.json, or missing usablesingle_turnrow: stop and classify as follow-up. - If fallback is observed and reason counts are missing/unspecified: stop and classify as follow-up.
- If dense-relative regressions are material or unresolved at boundary slices: keep experimental/non-default.
- Only validated slices are eligible for default commands.
Where Wayfinder Helps / Does Not Help
- Helps (validated default path): GLM stable profile at
T=8192, decode_len=32with large e2e/prefill gains and lower memory (2026-02-18 run above). - Does not help (current non-default boundaries): Nanbeige
T=131072long-boundary slices remain decode-limited and fallback-heavy;decode_len=256is a confirmed regression anddecode_len=32still loses decode throughput.