Memoize col in conv_im2col functions to reduce allocations #635

Octogonapus · 2025-03-12T18:16:27Z

This PR adds a memo in the conv_im2col functions which is used to memoize col (some scratchspace those functions use).
This dramatically reduces allocations for convolutional layers and increases throughput.
There is a new public function free_scratchspace_memo! to clear that memo, should the end user wish to do so once they are done using conv.

Before (using ObjectDetector as a benchmark):

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.908225 seconds (52.34 k allocations: 51.040 GiB, 18.35% gc time)

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.986337 seconds (52.34 k allocations: 51.040 GiB, 17.18% gc time)

After:

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.943823 seconds (53.09 k allocations: 14.850 GiB, 9.25% gc time) # the first call with an empty memo will allocate to fill the memo

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.800984 seconds (53.00 k allocations: 1.101 GiB, 3.18% gc time)

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.732375 seconds (53.00 k allocations: 1.101 GiB, 0.32% gc time)

We saved 50 GiB of allocations!

An alternative design is to plumb the col parameter through lots of intermediate generated functions, which seems quite complex.

PR Checklist

Tests are added
Documentation, if applicable

Octogonapus · 2025-03-12T19:17:59Z

The Lux.jl test failures look unrelated to this PR. The Flux.jl test failures also look unrelated but they are confusing to me as I am not very familiar with that project.

Octogonapus · 2025-03-13T20:56:55Z

Eric Hanson noted that this might be poor for GPU workloads which often run out of memory. That's a good point, so maybe this could be configured via scoped values or a global configuration mechanism. I would be interested in what the maintainers think before making changes.

ToucheSir · 2025-03-14T05:44:15Z

Rephrasing my thoughts from the Slack thread,

That's a good point, so maybe this could be configured via scoped values

this sounds like a cool idea. Perhaps we could let users plug in their own allocation function or allocator that follows a defined interface?

Octogonapus · 2025-03-14T13:58:29Z

Someone also brought up https://juliagpu.github.io/GPUArrays.jl/dev/interface/#Caching-Allocator in that thread which I will look into

pxl-th · 2025-03-15T11:17:10Z

Someone also brought up https://juliagpu.github.io/GPUArrays.jl/dev/interface/#Caching-Allocator in that thread which I will look into

To implement it for array, you basically have to use GPUArrays.cached_alloc function, where first argument is the function that does actual allocation (if no cache exists) and second is the tuple/key which will be used to find cache for that allocation
You can take a look at how AMDGPU does it:
https://github.com/JuliaGPU/AMDGPU.jl/blob/822da911e34aa319808bf93d61178c45e759df75/src/array.jl#L9

Octogonapus added 2 commits March 12, 2025 14:10

Memoize col in conv_im2col functions to reduce allocations

fb17aa8

patch bump

ae04bbb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memoize col in conv_im2col functions to reduce allocations #635

Memoize col in conv_im2col functions to reduce allocations #635

Octogonapus commented Mar 12, 2025 •

edited

Loading

Octogonapus commented Mar 12, 2025

Octogonapus commented Mar 13, 2025

ToucheSir commented Mar 14, 2025

Octogonapus commented Mar 14, 2025

pxl-th commented Mar 15, 2025

Memoize col in conv_im2col functions to reduce allocations #635

Are you sure you want to change the base?

Memoize col in conv_im2col functions to reduce allocations #635

Conversation

Octogonapus commented Mar 12, 2025 • edited Loading

PR Checklist

Octogonapus commented Mar 12, 2025

Octogonapus commented Mar 13, 2025

ToucheSir commented Mar 14, 2025

Octogonapus commented Mar 14, 2025

pxl-th commented Mar 15, 2025

Octogonapus commented Mar 12, 2025 •

edited

Loading