Skip to content

Conversation

@finbarrtimbers
Copy link
Collaborator

@finbarrtimbers finbarrtimbers commented Oct 22, 2025

Previously, we would occasionally get >100%.

Fixes #1098. Also adds logging so we can reproduce calculations if there are future issues with MFU/MBU calculations.

Fixes were:

  1. We now properly account for sliding window attention.
  2. Updated A100 memory bandwidth to be for the 80GB version and not the 40GB one.
  3. Fixed a bug in the way we were account for the number of heads (taken from the vLLM parallel_config, which is wrong, as that is divided by the way we shard).
  4. Properly accounts for multiple devices.

Note

Refactors and fixes MFU/MBU computations to account for sliding-window attention and multi-engine setups, updates GPU specs, integrates new APIs in training/benchmark paths, and adds targeted tests with reproduction cases.

  • Metrics/Utilization (core):
    • Add ModelDims.calculate_mfu/mbu, calculate_actor_utilization, calculate_learner_utilization and use them in grpo_fast.py and benchmark_generators.py.
    • Rework FLOPs/memory math to handle sliding-window attention and correct KV head counts; adjust KV read/write and per-layer computations.
    • Implement per-engine memory_bytes(prompt_lengths, num_engines, num_gpus_per_engine, ...) to normalize MBU across parallel engines.
    • Add check_calculation warning with repro JSON when MFU/MBU > 100%.
  • Model/VLLM integration:
    • ModelDims.from_vllm_config pulls heads directly from hf_text_config and captures sliding-window layer counts.
  • GPU specs:
    • Update A100 80GB memory bandwidth to 2.0e12 B/s.
  • Training/Benchmark wiring:
    • Replace inline MFU/MBU math with calls to new ModelDims APIs; derive num_engines/num_gpus_per_engine for correct normalization.
  • Tests & data:
    • Add MBU regression cases in open_instruct/test_data/mbu_reproduction_cases.json.
    • New tests in test_utils.py for FLOPs/memory, MFU/MBU (incl. multi-engine) and from_vllm_config parity.

Written by Cursor Bugbot for commit b839f17. This will update automatically on new commits. Configure here.

@finbarrtimbers finbarrtimbers marked this pull request as ready for review October 28, 2025 18:42
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

finbarrtimbers and others added 8 commits October 29, 2025 13:28
Added back all docstrings and inline comments that were removed during
the sliding window implementation. These comments explain the assumptions,
calculations, and design decisions in the FLOP and memory bandwidth
estimation code.

Changes:
- Restored docstrings for all ModelDims methods (attn_flops, mlp_flops,
  prefill_flops, decode_flops, flops, weight_memory_bytes,
  kv_cache_write_bytes, kv_cache_read_bytes, prefill_memory_bytes,
  decode_memory_bytes, memory_bytes)
- Restored inline comments explaining calculation details
- Kept all functionality changes (sliding window support, A100 bandwidth fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Changed attn_flops signature from using a boolean use_sliding_window flag
to accepting the sliding_window value directly as an Optional[int]. This
makes the API cleaner and more explicit.

Changes:
- attn_flops now takes sliding_window: Optional[int] = None instead of
  use_sliding_window: bool = False
- Uses kv_len = min(kv_len, sliding_window or float("inf")) to handle
  None case elegantly
- Updated all call sites in prefill_flops and decode_flops to pass
  sliding_window=None for full attention layers and
  sliding_window=self.sliding_window for sliding window layers

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
cursor[bot]

This comment was marked as outdated.

prefill_flops = model_dims.flops([sequence_length], None)
decode_flops = total_flops - prefill_flops
decode_flops_in_gflops = decode_flops / 1e9
self.assertAlmostEqual(decode_flops_in_gflops, 27.92, delta=0.01)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from some sanity checking that I did manually for the olmo3 paper.

total_bytes *= 2

memory_in_gb = total_bytes / 1e9
self.assertAlmostEqual(memory_in_gb, 3.926, delta=0.01)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from some sanity checking that I did manually for the olmo3 paper.

cursor[bot]

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unsure: actor MBU might have a bug

2 participants