Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memoize col in conv_im2col functions to reduce allocations #635

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Octogonapus
Copy link

@Octogonapus Octogonapus commented Mar 12, 2025

This PR adds a memo in the conv_im2col functions which is used to memoize col (some scratchspace those functions use).
This dramatically reduces allocations for convolutional layers and increases throughput.
There is a new public function free_scratchspace_memo! to clear that memo, should the end user wish to do so once they are done using conv.

Before (using ObjectDetector as a benchmark):

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.908225 seconds (52.34 k allocations: 51.040 GiB, 18.35% gc time)

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.986337 seconds (52.34 k allocations: 51.040 GiB, 17.18% gc time)

After:

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.943823 seconds (53.09 k allocations: 14.850 GiB, 9.25% gc time) # the first call with an empty memo will allocate to fill the memo

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.800984 seconds (53.00 k allocations: 1.101 GiB, 3.18% gc time)

julia> @time yolomod(batch, detectThresh=0.5, overlapThresh=0.8);
  0.732375 seconds (53.00 k allocations: 1.101 GiB, 0.32% gc time)

We saved 50 GiB of allocations!

An alternative design is to plumb the col parameter through lots of intermediate generated functions, which seems quite complex.

PR Checklist

  • Tests are added
  • Documentation, if applicable

@Octogonapus
Copy link
Author

The Lux.jl test failures look unrelated to this PR. The Flux.jl test failures also look unrelated but they are confusing to me as I am not very familiar with that project.

@Octogonapus
Copy link
Author

Eric Hanson noted that this might be poor for GPU workloads which often run out of memory. That's a good point, so maybe this could be configured via scoped values or a global configuration mechanism. I would be interested in what the maintainers think before making changes.

@ToucheSir
Copy link
Member

Rephrasing my thoughts from the Slack thread,

That's a good point, so maybe this could be configured via scoped values

this sounds like a cool idea. Perhaps we could let users plug in their own allocation function or allocator that follows a defined interface?

@Octogonapus
Copy link
Author

Someone also brought up https://juliagpu.github.io/GPUArrays.jl/dev/interface/#Caching-Allocator in that thread which I will look into

@pxl-th
Copy link
Member

pxl-th commented Mar 15, 2025

Someone also brought up https://juliagpu.github.io/GPUArrays.jl/dev/interface/#Caching-Allocator in that thread which I will look into

To implement it for array, you basically have to use GPUArrays.cached_alloc function, where first argument is the function that does actual allocation (if no cache exists) and second is the tuple/key which will be used to find cache for that allocation
You can take a look at how AMDGPU does it:
https://github.com/JuliaGPU/AMDGPU.jl/blob/822da911e34aa319808bf93d61178c45e759df75/src/array.jl#L9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants