Skip to content

Conversation

@jonathanpwang
Copy link
Contributor

@jonathanpwang jonathanpwang commented Nov 14, 2025

  • MemoryMerkleTree was using a very strange pattern with its own stream passing to/from default stream. I don't think this is necessary since most of the work is done on subtree streams and the finalize kernel can be on the default stream to simplify things.
  • Went through all places where an event was dropped (which destroys it) before the event is actually awaited on the stream. For subtree streams I just made them all synchronize since those streams need to be completed anyways. I did a small optimization to avoid another synchronize on the default stream (perhaps unnecessary) where after a D2H transfer, I remove the events that must have been observed on all streams.

Comparison to show there's no perf regression: https://github.com/axiom-crypto/openvm-reth-benchmark/actions/runs/19352734146

Copilot AI review requested due to automatic review settings November 14, 2025 02:45
Copilot finished reviewing on behalf of jonathanpwang November 14, 2025 02:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes CUDA stream and event lifecycle issues to ensure events are properly observed before being destroyed. The main purpose is to prevent premature destruction of events and streams that could cause synchronization issues or resource corruption.

Key changes:

  • Removed the dedicated stream from MemoryMerkleTree and simplified synchronization to use the default stream
  • Split event tracking in MemoryMerkleSubTree into created_buffer_event and build_completion_event for clearer lifecycle management
  • Added explicit event cleanup in update_with_touched_blocks after D2H synchronization
  • Implemented Drop for MemoryMerkleTree to ensure proper cleanup

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
crates/vm/src/system/cuda/merkle_tree/mod.rs Refactored stream/event handling: removed dedicated tree stream, split subtree events into creation and completion events, improved synchronization in finalize() and drop_subtrees(), added Drop implementation, and optimized event cleanup after D2H transfers
crates/vm/src/system/cuda/memory.rs Updated Drop implementation to use the refactored drop_subtrees() method with clearer documentation about synchronization requirements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions

This comment has been minimized.

@codspeed-hq
Copy link

codspeed-hq bot commented Nov 14, 2025

CodSpeed Performance Report

Merging #2242 will improve performances by ×12

Comparing fix/cuda-memory-stream (682e44c) with main (7e94889)1

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

⚡ 24 improvements
⏩ 36 skipped2

Benchmarks breakdown

Mode Benchmark BASE HEAD Change
WallTime benchmark_execute[bubblesort] 27.4 ms 5.8 ms ×4.7
WallTime benchmark_execute[factorial_iterative_u256] 108.7 ms 29.4 ms ×3.7
WallTime benchmark_execute[fibonacci_iterative] 30.2 ms 2.5 ms ×12
WallTime benchmark_execute[fibonacci_recursive] 44.9 ms 8.1 ms ×5.6
WallTime benchmark_execute[keccak256] 32.8 ms 5.6 ms ×5.9
WallTime benchmark_execute[keccak256_iter] 96.9 ms 52.4 ms +85.02%
WallTime benchmark_execute[pairing] 137.7 ms 112.1 ms +22.91%
WallTime benchmark_execute[quicksort] 31 ms 2.9 ms ×11
WallTime benchmark_execute[revm_snailtracer] 16.7 ms 6.9 ms ×2.4
WallTime benchmark_execute[revm_transfer] 42.8 ms 9.9 ms ×4.3
WallTime benchmark_execute[sha256] 31.2 ms 4 ms ×7.8
WallTime benchmark_execute[sha256_iter] 68.2 ms 28.2 ms ×2.4
WallTime benchmark_execute_metered[bubblesort] 52.7 ms 11.3 ms ×4.7
WallTime benchmark_execute_metered[factorial_iterative_u256] 230.5 ms 73.5 ms ×3.1
WallTime benchmark_execute_metered[fibonacci_iterative] 74 ms 14.3 ms ×5.2
WallTime benchmark_execute_metered[fibonacci_recursive] 102 ms 16.8 ms ×6.1
WallTime benchmark_execute_metered[keccak256] 69.9 ms 11 ms ×6.3
WallTime benchmark_execute_metered[keccak256_iter] 163.8 ms 72.7 ms ×2.3
WallTime benchmark_execute_metered[pairing] 157.2 ms 117.9 ms +33.38%
WallTime benchmark_execute_metered[quicksort] 59.4 ms 9.1 ms ×6.5
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Footnotes

  1. No successful run was found on main (432dada) during the generation of this report, so 7e94889 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

  2. 36 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Copy link
Contributor

@Golovanov399 Golovanov399 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing everything on the default stream should work, but I wouldn't trust my judgement on this

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@nyunyunyunyu nyunyunyunyu force-pushed the fix/cuda-memory-stream branch from 8f23d8c to 682e44c Compare November 17, 2025 18:46
Copy link
Contributor

@gaxiom gaxiom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Mem Merkle Tree will wait for all subtrees to complete. I think that's fine

@github-actions
Copy link

group app.proof_time_ms app.cycles app.cells_used leaf.proof_time_ms leaf.cycles leaf.cells_used
verify_fibair (+4 [+1.7%]) 237 322,610 2,058,654 - - -
fibonacci (-14 [-1.4%]) 968 1,500,209 2,100,402 - - -
regex (+32 [+1.4%]) 2,372 4,137,502 17,695,216 - - -
ecrecover (-6 [-0.8%]) 706 122,859 2,265,100 - - -
pairing (+25 [+1.8%]) 1,451 1,745,742 25,468,210 - - -

Commit: 682e44c

Benchmark Workflow

@jonathanpwang jonathanpwang merged commit 8e48fbb into main Nov 17, 2025
70 of 71 checks passed
@jonathanpwang jonathanpwang deleted the fix/cuda-memory-stream branch November 17, 2025 22:13
@jonathanpwang jonathanpwang restored the fix/cuda-memory-stream branch November 17, 2025 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants