Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make upload command buffer operations thread-safe #223

Merged
merged 3 commits into from
Dec 7, 2024

Conversation

cryy22
Copy link
Contributor

@cryy22 cryy22 commented Dec 6, 2024

this PR resolves a compatibility issue where projects that use threading for GPU copy operations would sometimes crash due to thread contention when using the SDLGPU driver. the contended resources were the upload and render command buffers and associated copy/render passes, and corresponding metadata.

this PR resolves the issue with the following approach:

  • gate copy pass access and associated metadata behind a corresponding copy pass mutex
    • adding upload commands to the copy pass locks the mutex until the command is added and metadata is updated
    • mutex is locked when the upload command buffer is submitted; new command buffer is acquired and copy pass begun before mutex is unlocked
  • in copy pass-exclusive methods (e.g. SetTextureData), when a command flush is required, only flush the upload command buffer so as to avoid interfering with the render command buffer.
  • always flush the upload command buffer before any render command buffer submissions (and lock the mutex)
  • no mutex for render command buffer access; assume that all render ops are scheduled and run by the main thread only

to test this implementation, i ran the following ruby script:

n_test.rb
#!/usr/bin/env ruby
# frozen_string_literal: true

require 'open3'

n = ARGV[0].to_i
puts "running startup #{n} times"

env = { 'FNA_PLATFORM_BACKEND' => 'SDL3', 'FNA3D_FORCE_DRIVER' => 'SDLGPU' }

n.times do |index|
  puts "attempt #{index + 1}"
  Open3.popen3(env, './Celeste') do |_, output, error, t|
    while (line = output.gets)
      next unless line.include?('DONE LOADING')

      puts('success')
      Process.kill('TERM', t.pid)
    end

    unless t.value.success?
      puts "failed to load assets on attempt #{index + 1}"
      puts error.read
      Process.kill('KILL', t.pid)
      exit 1
    end
  end
end

the script opens Celeste and waits for it to print DONE LOADING on stdout, kills the program and does it again, n times. if the program encounters an error it prints stderr and exits.

initially, i ran this against an FNA3D build from the current master branch. Celeste crashed on startup roughly 10% of the time with expected errors (e.g. "Command buffer already submitted!", "Cannot acquire a swapchain texture during a pass!"). with an FNA3D build from this PR, i ran this script with n=500 without any failures.

submitting as a draft because i would like to see a successful test against another FNA game known to use threaded uploads at startup. i am working on testing TMNT but it requires a new dev environment (no mac builds), so assistance would be appreciated. please note that the script above is hard-coded to work with Celeste only as it depends on the specific DONE LOADING stdout line that i am pretty sure originates from the Celeste codebase.

@kg
Copy link
Contributor

kg commented Dec 6, 2024

I have threaded uploads (disabled because they currently crash on SDL_GPU) so i'll try and test with this ASAP.

@flibitijibibo
Copy link
Member

Will test with TMNT and SOR4 soon!

@flibitijibibo
Copy link
Member

Looks like the CI failures are just the SDL3 beta tag being old - we should ask Sam to tag another preview release.

In the meantime, the Linux SDL3 test branches on Steam for both TMNT and SOR4 have been updated with this draft.

@flibitijibibo
Copy link
Member

Looks like we're just barely off; changing this to 3.1.6 should fix it?

https://github.com/FNA-XNA/FNA3D/blob/master/.github/workflows/ci.yml#L7

@flibitijibibo
Copy link
Member

Man, SDL3 releases are a mess right now... I'll take care of this on my end, sorry!!

@flibitijibibo
Copy link
Member

Latest upstream fixes CI. I can also take the .gitignore changes as a separate patch while we review the GPU changes.

@thatcosmonaut thatcosmonaut linked an issue Dec 6, 2024 that may be closed by this pull request
@cryy22 cryy22 force-pushed the cmd-buf-threadsafe branch from b533923 to 70c2bed Compare December 6, 2024 19:55
@cryy22
Copy link
Contributor Author

cryy22 commented Dec 6, 2024

rebased on upstream/master. gitignore patch!

diff --git a/.gitignore b/.gitignore
index 525173d..6f598aa 100644
--- a/.gitignore
+++ b/.gitignore
@@ -7,8 +7,10 @@ obj/
 Debug/
 Release/
 *.vcxproj.user
+.vscode/
 *.suo
 .DS_Store
 xcuserdata/
 *.xcworkspace/
-*.xcscheme
\ No newline at end of file
+*.xcscheme
+build/

Copy link
Member

@flibitijibibo flibitijibibo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes overall lgtm - basically all I can think of is rearranging stuff to shrink the size of the diff, but I'll happily take a larger diff if we get better organization out of it.

@thatcosmonaut thatcosmonaut marked this pull request as ready for review December 6, 2024 20:03
@cryy22
Copy link
Contributor Author

cryy22 commented Dec 6, 2024

@flibitijibibo sweet! the main organizational change was just moving BeginCopyPass, EndCopyPass, BeginRenderPass, and EndRenderPass into the same general location. still ended up a little messy due to their dependencies on other functions and each other, so i'm perfectly happy to put em back in the interest of continuity if you prefer, let me know!

Copy link
Member

@flibitijibibo flibitijibibo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran through the changes in more detail and the new arrangement is definitely better, just one nit and I'll go ahead and merge so fnalibs-dailies can have this for QA!

@flibitijibibo flibitijibibo merged commit 1cdafc1 into FNA-XNA:master Dec 7, 2024
14 checks passed
@cryy22 cryy22 deleted the cmd-buf-threadsafe branch January 10, 2025 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Threading support for SDL_GPU
4 participants