fix(cuda): ensure streams have observed events before destroying #2242

jonathanpwang · 2025-11-14T02:45:50Z

MemoryMerkleTree was using a very strange pattern with its own stream passing to/from default stream. I don't think this is necessary since most of the work is done on subtree streams and the finalize kernel can be on the default stream to simplify things.
Went through all places where an event was dropped (which destroys it) before the event is actually awaited on the stream. For subtree streams I just made them all synchronize since those streams need to be completed anyways. I did a small optimization to avoid another synchronize on the default stream (perhaps unnecessary) where after a D2H transfer, I remove the events that must have been observed on all streams.

Comparison to show there's no perf regression: https://github.com/axiom-crypto/openvm-reth-benchmark/actions/runs/19352734146

Copilot

Pull Request Overview

This PR fixes CUDA stream and event lifecycle issues to ensure events are properly observed before being destroyed. The main purpose is to prevent premature destruction of events and streams that could cause synchronization issues or resource corruption.

Key changes:

Removed the dedicated stream from MemoryMerkleTree and simplified synchronization to use the default stream
Split event tracking in MemoryMerkleSubTree into created_buffer_event and build_completion_event for clearer lifecycle management
Added explicit event cleanup in update_with_touched_blocks after D2H synchronization
Implemented Drop for MemoryMerkleTree to ensure proper cleanup

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
crates/vm/src/system/cuda/merkle_tree/mod.rs	Refactored stream/event handling: removed dedicated tree stream, split subtree events into creation and completion events, improved synchronization in `finalize()` and `drop_subtrees()`, added `Drop` implementation, and optimized event cleanup after D2H transfers
crates/vm/src/system/cuda/memory.rs	Updated `Drop` implementation to use the refactored `drop_subtrees()` method with clearer documentation about synchronization requirements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/vm/src/system/cuda/merkle_tree/mod.rs

codspeed-hq · 2025-11-14T02:58:51Z

CodSpeed Performance Report

Merging #2242 will improve performances by ×12

_{Comparing fix/cuda-memory-stream (682e44c) with main (7e94889)¹}

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

⚡ 24 improvements
⏩ 36 skipped²

Benchmarks breakdown

	Mode	Benchmark	`BASE`	`HEAD`	Change
⚡	WallTime	`benchmark_execute[bubblesort]`	27.4 ms	5.8 ms	×4.7
⚡	WallTime	`benchmark_execute[factorial_iterative_u256]`	108.7 ms	29.4 ms	×3.7
⚡	WallTime	`benchmark_execute[fibonacci_iterative]`	30.2 ms	2.5 ms	×12
⚡	WallTime	`benchmark_execute[fibonacci_recursive]`	44.9 ms	8.1 ms	×5.6
⚡	WallTime	`benchmark_execute[keccak256]`	32.8 ms	5.6 ms	×5.9
⚡	WallTime	`benchmark_execute[keccak256_iter]`	96.9 ms	52.4 ms	+85.02%
⚡	WallTime	`benchmark_execute[pairing]`	137.7 ms	112.1 ms	+22.91%
⚡	WallTime	`benchmark_execute[quicksort]`	31 ms	2.9 ms	×11
⚡	WallTime	`benchmark_execute[revm_snailtracer]`	16.7 ms	6.9 ms	×2.4
⚡	WallTime	`benchmark_execute[revm_transfer]`	42.8 ms	9.9 ms	×4.3
⚡	WallTime	`benchmark_execute[sha256]`	31.2 ms	4 ms	×7.8
⚡	WallTime	`benchmark_execute[sha256_iter]`	68.2 ms	28.2 ms	×2.4
⚡	WallTime	`benchmark_execute_metered[bubblesort]`	52.7 ms	11.3 ms	×4.7
⚡	WallTime	`benchmark_execute_metered[factorial_iterative_u256]`	230.5 ms	73.5 ms	×3.1
⚡	WallTime	`benchmark_execute_metered[fibonacci_iterative]`	74 ms	14.3 ms	×5.2
⚡	WallTime	`benchmark_execute_metered[fibonacci_recursive]`	102 ms	16.8 ms	×6.1
⚡	WallTime	`benchmark_execute_metered[keccak256]`	69.9 ms	11 ms	×6.3
⚡	WallTime	`benchmark_execute_metered[keccak256_iter]`	163.8 ms	72.7 ms	×2.3
⚡	WallTime	`benchmark_execute_metered[pairing]`	157.2 ms	117.9 ms	+33.38%
⚡	WallTime	`benchmark_execute_metered[quicksort]`	59.4 ms	9.1 ms	×6.5
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

No successful run was found on main (432dada) during the generation of this report, so 7e94889 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩
36 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Golovanov399

I think doing everything on the default stream should work, but I wouldn't trust my judgement on this

crates/vm/src/system/cuda/merkle_tree/mod.rs

gaxiom

LGTM.
Mem Merkle Tree will wait for all subtrees to complete. I think that's fine

github-actions · 2025-11-17T18:53:30Z

group	app.proof_time_ms	app.cycles	app.cells_used	leaf.proof_time_ms	leaf.cycles	leaf.cells_used
verify_fibair	(+4 [+1.7%]) 237	322,610	2,058,654	-	-	-
fibonacci	(-14 [-1.4%]) 968	1,500,209	2,100,402	-	-	-
regex	(+32 [+1.4%]) 2,372	4,137,502	17,695,216	-	-	-
ecrecover	(-6 [-0.8%]) 706	122,859	2,265,100	-	-	-
pairing	(+25 [+1.8%]) 1,451	1,745,742	25,468,210	-	-	-

Commit: 682e44c

Benchmark Workflow

Copilot AI review requested due to automatic review settings November 14, 2025 02:45

Copilot started reviewing on behalf of jonathanpwang November 14, 2025 02:46 View session

jonathanpwang requested review from Golovanov399 and gaxiom November 14, 2025 02:47

Copilot finished reviewing on behalf of jonathanpwang November 14, 2025 02:49

Copilot AI reviewed Nov 14, 2025

View reviewed changes

crates/vm/src/system/cuda/merkle_tree/mod.rs Show resolved Hide resolved

crates/vm/src/system/cuda/merkle_tree/mod.rs Show resolved Hide resolved

This comment has been minimized.

Sign in to view

Golovanov399 approved these changes Nov 14, 2025

View reviewed changes

crates/vm/src/system/cuda/merkle_tree/mod.rs Show resolved Hide resolved

crates/vm/src/system/cuda/merkle_tree/mod.rs Show resolved Hide resolved

This comment has been minimized.

Sign in to view

jonathanpwang force-pushed the fix/cuda-memory-stream branch from aae868a to 8f23d8c Compare November 17, 2025 18:01

This comment has been minimized.

Sign in to view

jonathanpwang added 4 commits November 17, 2025 18:46

fix(cuda): use default stream for merkle tree finalize

f6ef5cd

fix(cuda): default stream sync before destroying events

d6f998e

fix: do not destroy created_buffer_event until sync

9709a94

fix: force a sync on drop subtree for safety

682e44c

nyunyunyunyu force-pushed the fix/cuda-memory-stream branch from 8f23d8c to 682e44c Compare November 17, 2025 18:46

gaxiom approved these changes Nov 17, 2025

View reviewed changes

jonathanpwang merged commit 8e48fbb into main Nov 17, 2025
70 of 71 checks passed

jonathanpwang deleted the fix/cuda-memory-stream branch November 17, 2025 22:13

jonathanpwang restored the fix/cuda-memory-stream branch November 17, 2025 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(cuda): ensure streams have observed events before destroying #2242

fix(cuda): ensure streams have observed events before destroying #2242

Uh oh!

jonathanpwang commented Nov 14, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

codspeed-hq bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

Golovanov399 left a comment

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

gaxiom left a comment

Uh oh!

github-actions bot commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix(cuda): ensure streams have observed events before destroying #2242

fix(cuda): ensure streams have observed events before destroying #2242

Uh oh!

Conversation

jonathanpwang commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

codspeed-hq bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #2242 will improve performances by ×12

Summary

Benchmarks breakdown

Footnotes

Uh oh!

Golovanov399 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

gaxiom left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jonathanpwang commented Nov 14, 2025 •

edited

Loading

codspeed-hq bot commented Nov 14, 2025 •

edited

Loading