Skip to content

[AUDIO_WORKLET] Optimised output buffer copy #24891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

cwoffenden
Copy link
Contributor

@cwoffenden cwoffenden commented Aug 8, 2025

A reworking of #22753, which "improves the copy back from the audio worklet's heap to JS by 7-12x depending on the browser." From the previous description:

Since we pass in the stack for the worklet from the caller's heap, its address doesn't change. And since the render quantum size doesn't change after the audio worklet creation, the stack positions for the audio buffers do not change either. This optimisation adds one-time subarray views and replaces the float-by-float copy with a simple set() per channel (per output).

The existing interactive tests (written for the original PR) can be run for comparison:

test/runner interactive.test_audio_worklet_stereo_io
test/runner interactive.test_audio_worklet_2x_stereo_io
test/runner interactive.test_audio_worklet_mono_io
test/runner interactive.test_audio_worklet_2x_hard_pan_io
test/runner interactive.test_audio_worklet_params_mixing
test/runner interactive.test_audio_worklet_memory_growth
test/runner interactive.test_audio_worklet_hard_pans

These test various input/output arrangements as well as parameters (parameters are interesting because, depending on the browser, the sizes change as the params move from static to varying).

The original benchmark of the extracted copy is still valid:

https://wip.numfum.com/cw/2024-10-29/index.html

This is tested with 32- and 64-bit wasm (which required a reordering of how structs and data were stored to avoid alignment issues).

Some explanations:

  • Fixed-position output buffer views are created once in theWasmAudioWorkletProcessor constructor
  • Stack allocations for the process() call are split into aligned struct data (see the comments) and audio/param data
  • The struct writes are simplified by this splitting of data
  • ASSERTIONS are used to ensure everything fits and correctly aligns
  • The tests account for size changes in the params, which can vary from a single float to 128 floats (a single float nicely showing up any 8-byte alignment issues for wasm64)

Future improvements: the output views are sequential, so instead of of being individual views covering each channel the views could cover one to however-many-views needed, with a single set() being enough for all outputs.

@cwoffenden cwoffenden force-pushed the cw-aw-optimised-copy branch from 5a70474 to 99f4c12 Compare August 8, 2025 18:23
@cwoffenden cwoffenden marked this pull request as draft August 8, 2025 18:23
@cwoffenden cwoffenden force-pushed the cw-aw-optimised-copy branch 2 times, most recently from e19d4fd to 348932f Compare August 12, 2025 16:17
@cwoffenden cwoffenden marked this pull request as ready for review August 12, 2025 19:54
@cwoffenden cwoffenden force-pushed the cw-aw-optimised-copy branch from 2313240 to 748c167 Compare August 13, 2025 08:30
@sbc100 sbc100 requested a review from juj August 13, 2025 15:16
@cwoffenden cwoffenden marked this pull request as draft August 13, 2025 17:31
@cwoffenden cwoffenden force-pushed the cw-aw-optimised-copy branch 2 times, most recently from 4cba21c to 998ac6b Compare August 14, 2025 14:45
@cwoffenden cwoffenden force-pushed the cw-aw-optimised-copy branch from b57d103 to 5186686 Compare August 14, 2025 18:37
@cwoffenden cwoffenden marked this pull request as ready for review August 15, 2025 10:50
@cwoffenden
Copy link
Contributor Author

cwoffenden commented Aug 15, 2025

It's not possible to merge the output data copy into a single set() (nor the input data), since the receiving float arrays are independent.

sbc100 pushed a commit that referenced this pull request Aug 19, 2025
This adds an interactive test to force growing the heap during playback:
```
test/runner interactive.test_audio_worklet_memory_growth
```
Tested with `interactive64` and `interactive_2gb` (for
`interactive64_4gb` the heap is already at the browser's max in testing
so can't be grown).

It works by alloc'ing and leaking 2/3 of the current size until it can
no longer do so. Emscripten regrows its wasm memory in the process,
invalidating any data views (see #24891).

**Edit: test can now grow from both the main and audio thread.**
sbc100 pushed a commit that referenced this pull request Aug 19, 2025
Built on #24931 (it touches the same file, but a rebase after merge will
fix this). This adds hard-panned audio files to test that the left and
right channels don't get flipped with any changes to the audio worklet
code (relevant for #24891, which changes how the copies are performed).
```
test/runner interactive.test_audio_worklet_hard_pans
```
The bass track is hard-left (with its right muted), drums are right.
@cwoffenden cwoffenden requested a review from sbc100 August 19, 2025 19:31
@cwoffenden
Copy link
Contributor Author

cwoffenden commented Aug 19, 2025

This is done, though I can rebase with all the tests. and merged so the tests are available.

@cwoffenden cwoffenden requested a review from juj August 20, 2025 21:06
var outputViewsNeeded = 0;
for (entry of outputList) {
outputViewsNeeded += entry.length;
}
Copy link
Collaborator

@juj juj Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above loop multiplies by * this.bytesPerChannel; outside the loop, though the loop before that multiplies inside the loop. A small micro-opt would be to merge to multiply once outside. (though looks good to me either way)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look over this. The most compact would probably be to take the incoming sizes (but I work on the assumption that the browser will do something wrong at some point in the future and throw the maths off).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look and reused the stackMemoryData var (committed). I'm not sure it improves the readability, whereas for outputViewsNeeded it made sense. I'm not sure there are enough channels to warrant it.

this.bytesPerChannel = this.samplesPerChannel * {{{ getNativeTypeSize('float') }}};

// Creates the output views (see createOutputViews() docs)
this.maxBuffers = Math.min(((wwParams.stackSize - /*minimum alloc*/ 16) / this.bytesPerChannel) | 0, /*sensible limit*/ 64);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't ever use this.maxBuffers beyond the call to this.createOutputViews() below?

One way to recoup some of the code size increase would be to avoid storing this.xxx members that aren't needed outsided this scope.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, we need to know the number of buffers when we call again to this.createOutputViews();.

Maybe then this could do

this.outputViews = new Array(Math.min(((wwParams.stackSize - /*minimum alloc*/ 16) / this.bytesPerChannel) | 0, /*sensible limit*/ 64));

and instead of

this.outputViews.unshift(

inside createOutputViews(), the function would always assign to this.outputViews[i] = , reusing the array with fixed pre-initialized length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has merit and will result in a saving. Assertions can be against the array length.

I'll try to take a look today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I factored out tracking the length. Update incoming.

Copy link
Collaborator

@juj juj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the logic looks solid now.

I had a couple of minor comments, but those are cosmetic. LGTM either way, this looks good to land to me.

@cwoffenden
Copy link
Contributor Author

LGTM, the logic looks solid now.

Thanks!

@juj
Copy link
Collaborator

juj commented Aug 21, 2025

One thing that comes to mind that may be worth probing: The TLS section is allocated at the end of the stack, (it should be exclusive of the stack size though).

So you may try to place a global variable __thread int foo = 42; variable in the test test/webaudio/audioworklet_params_mixing.c, and add an assert(foo == 42); to verify that its value never changes during the test. This assert would be inside the audio worklet callback process call.

This will verify that the allocation from stack top won't stomp on TLS variables.

@cwoffenden
Copy link
Contributor Author

One thing that comes to mind that may be worth probing

[snip]

I can add this as a separate PR, internals like this were one of my worries when starting this last year (most of those worries went away once it was working, but I'm still missing insights like this).

@juj
Copy link
Collaborator

juj commented Aug 21, 2025

Yeah that sounds good.

@cwoffenden
Copy link
Contributor Author

cwoffenden commented Aug 21, 2025

Changes in as per @juj's suggestions. CI is breaking but it's not related.

@cwoffenden
Copy link
Contributor Author

Yeah that sounds good.

#25024 tests this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants