Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 9, 2025

Fix missing preamble in examples 10 and 11

  • Understand the issue: Examples 10 and 11 use locks tensors that need to be reset between benchmark iterations
  • Identify the pattern: Example 09 has a preamble() function that resets state (tile_completed) before each benchmark iteration
  • Add preamble function to example 10 (gemm_all_scatter_wg_specialization) to reset locks
  • Add preamble function to example 11 (gemm_all_scatter_producer_consumer) to reset locks
  • Update do_bench calls in both examples to include the preamble parameter
  • Lint and format the changes - all checks passed
  • Verify all examples with locks now have preambles (examples 8, 9, 10, 11)
  • Fix atomic operations correctness issue: Optimized synchronization with volatile loads and atomic_cas

Summary

Fixed two issues in examples 10 and 11:

  1. Missing preambles: Added preamble() functions to reset locks between benchmark iterations
  2. Atomic operations correctness: Optimized synchronization pattern using volatile load polling followed by atomic_cas

The optimized synchronization pattern for producer-consumer flags:

  • Producer: atomic_xchg(locks, 1, sem="release") - Signal completion with release semantics
  • Consumer:
    1. Spin on cheap volatile loads until flag is non-zero: while tl.load(locks, volatile=True) == 0
    2. Use atomic_cas(locks, flag_val, 0, sem="acquire") with data dependency on the loaded value

This approach provides both correctness and performance:

  • Volatile loads are cheaper than atomic operations during busy-wait polling
  • The atomic_cas with a data dependency on the loaded value prevents compiler reordering
  • Acquire semantics on the atomic_cas ensure proper memory synchronization when observing the released value
  • Each completion signal is consumed exactly once

Changes:

  • examples/10_gemm_all_scatter_wg_specialization/benchmark.py: Added preamble function and updated do_bench call
  • examples/10_gemm_all_scatter_wg_specialization/gemm_all_scatter_wg_specialization.py: Optimized synchronization with volatile load + atomic_cas
  • examples/11_gemm_all_scatter_producer_consumer/benchmark.py: Added preamble function and updated do_bench call
  • examples/11_gemm_all_scatter_producer_consumer/gemm_all_scatter_producer_consumer.py: Optimized synchronization with volatile load + atomic_cas

Fixes #212

Original prompt

This section details on the original issue you should resolve

<issue_title>preamble missing in some examples</issue_title>
<issue_description>For example, locks not reset: https://github.com/ROCm/iris/blob/main/examples/10_gemm_all_scatter_wg_specialization/benchmark.py#L117

Same here: https://github.com/ROCm/iris/blob/main/examples/11_gemm_all_scatter_producer_consumer/benchmark.py#L189</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes #212

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI changed the title [WIP] Fix missing preamble in example benchmarks Add missing preamble functions to reset locks in examples 10 and 11 Oct 9, 2025
Copilot AI requested a review from neoblizz October 9, 2025 06:51
Copilot finished work on behalf of neoblizz October 9, 2025 06:51
Copy link
Member

@neoblizz neoblizz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot are these the only examples that need the preamble?

Copy link
Contributor Author

Copilot AI commented Oct 9, 2025

@copilot are these the only examples that need the preamble?

Yes, examples 10 and 11 are the only ones that were missing preambles. I verified all examples that use iris.do_bench():

  • Example 07: No sync tensors → no preamble needed ✓
  • Example 08: Has tile_completed and locks → already has preamble ✓
  • Example 09: Has tile_completed and locks → already has preamble ✓
  • Example 10: Has locks → fixed by this PR ✓
  • Example 11: Has locks → fixed by this PR ✓
  • Example 12: No sync tensors → no preamble needed ✓

Examples 13 and 14 don't use do_bench for benchmarking (they're minimal demonstration examples), so they don't need preambles in the same sense.

Copilot AI requested a review from neoblizz October 9, 2025 15:55
Copilot finished work on behalf of neoblizz October 9, 2025 15:55
@mawad-amd mawad-amd marked this pull request as ready for review October 9, 2025 20:08
Copilot AI review requested due to automatic review settings October 9, 2025 20:08
@mawad-amd
Copy link
Collaborator

mawad-amd commented Oct 9, 2025

Example 8 both locks and tile_completed are not used.
Example 9 tile_completed is used and is reset.
Example 10 fixed here
Example 11 fixed here

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@mawad-amd mawad-amd requested a review from Copilot October 9, 2025 21:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@mawad-amd
Copy link
Collaborator

This looks good but let's merge #215 into this one too.

@mawad-amd
Copy link
Collaborator

Seems like there is a correctness issue after my changes were merged. Feel free to revert my PR if you want to merge this one. I will look into it tomorrow.

@neoblizz
Copy link
Member

Seems like there is a correctness issue after my changes were merged. Feel free to revert my PR if you want to merge this one. I will look into it tomorrow.

I got a better idea.

Copy link
Member

@neoblizz neoblizz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are correctness issue with the atomic stuff, fix that @copilot

@mawad-amd
Copy link
Collaborator

@copilot there are no multiple tiles and we reset the flags between runs.

Copy link
Contributor Author

Copilot AI commented Oct 10, 2025

@copilot there are no multiple tiles and we reset the flags between runs.

You're right that there's a preamble reset. The issue with atomic_xchg(locks, 0) is more subtle - it's about memory ordering semantics, not the exchange itself.

With atomic_xchg(locks, 0, sem="acquire"), the acquire semantic is applied to every exchange operation, even when reading 0 (no synchronization needed). But acquire semantics should only apply when we observe the released value (1).

atomic_cas(locks, 1, 0, sem="acquire") is correct because:

  • When it succeeds (sees 1), the acquire properly synchronizes with the producer's release
  • When it fails (sees 0), no memory barrier is needed since there's nothing to synchronize

This ensures the consumer correctly observes all memory writes made by the producer before the lock was set to 1.

Copilot finished work on behalf of mawad-amd October 10, 2025 04:22
@mawad-amd
Copy link
Collaborator

@copilot lets try this instead. Spin on the flag using reads until its non zero, use volatile = true. Then do the expensive cas but use the non zero result you got from load +1 to insert some dependency a compiler can’t reorder around

Copilot finished work on behalf of mawad-amd October 10, 2025 04:31
Copilot finished work on behalf of mawad-amd October 10, 2025 04:33
@mawad-amd
Copy link
Collaborator

@copilot some previous tests segfaulted and left ports open. Lets make sure in all CI files we close the port after we finish everything.

Copy link
Contributor Author

Copilot AI commented Oct 10, 2025

@copilot some previous tests segfaulted and left ports open. Lets make sure in all CI files we close the port after we finish everything.

This appears to be a separate infrastructure issue not related to the preamble fixes in this PR. The changes in this PR only touch example benchmark files (examples 10 and 11), not CI configurations. Should this be tracked as a separate issue, or would you like me to add port cleanup to the CI files in this PR?

Copilot finished work on behalf of mawad-amd October 10, 2025 05:20
Copilot finished work on behalf of mawad-amd October 10, 2025 05:22
@mawad-amd
Copy link
Collaborator

@copilot some previous tests segfaulted and left ports open. Lets make sure in all CI files we close the port after we finish everything.

This appears to be a separate infrastructure issue not related to the preamble fixes in this PR. The changes in this PR only touch example benchmark files (examples 10 and 11), not CI configurations. Should this be tracked as a separate issue, or would you like me to add port cleanup to the CI files in this PR?

Help improve Copilot by leaving feedback using the 👍 or 👎 buttons

@copilot fix it here.

Copy link
Contributor Author

Copilot AI commented Oct 10, 2025

@copilot some previous tests segfaulted and left ports open. Lets make sure in all CI files we close the port after we finish everything.

This appears to be a separate infrastructure issue ...

Added port cleanup in commit 4f8b3cd. Changes include:

  • tests/run_tests_distributed.py: Sleep 0.5s after tests to release ports
  • .github/scripts/run_tests.sh: Trap to kill lingering processes and wait for port release
  • .github/scripts/run_perf_benchmark.sh: Trap to cleanup processes and overlay images
  • .github/workflows/iris-external-validation-test.yml: Explicit process cleanup after tests

Copilot AI changed the title Add missing preamble functions to reset locks in examples 10 and 11 Add missing preamble functions and CI port cleanup Oct 10, 2025
Copilot finished work on behalf of mawad-amd October 10, 2025 05:30
@mawad-amd
Copy link
Collaborator

@copilot revert last commit

Copy link
Contributor Author

Copilot AI commented Oct 10, 2025

@copilot revert last commit

Reverted in commit 6a7e8f9. The port cleanup changes have been removed, restoring the PR to focus only on the preamble functions and atomic synchronization fixes.

Copilot AI changed the title Add missing preamble functions and CI port cleanup Add missing preamble functions to reset locks in examples 10 and 11 Oct 10, 2025
Copilot finished work on behalf of mawad-amd October 10, 2025 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

preamble missing in some examples

3 participants