Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checkbounds for gather and support empty source array #51

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

yuehhua
Copy link
Member

@yuehhua yuehhua commented Jun 3, 2022

@yuehhua yuehhua requested a review from mcabbott June 3, 2022 15:15
@yuehhua
Copy link
Member Author

yuehhua commented Jun 3, 2022

@mcabbott would you please review this?

@yuehhua yuehhua changed the title Add checkbounds for gather Add checkbounds for gather and support empty source array Jun 3, 2022
@yuehhua yuehhua closed this Jun 4, 2022
@yuehhua yuehhua reopened this Jun 4, 2022
Co-authored-by: Peter <[email protected]>

Update src/gather.jl

Co-authored-by: Peter <[email protected]>

Update src/gather.jl

Co-authored-by: Peter <[email protected]>

Update src/gather.jl

Co-authored-by: Peter <[email protected]>
Copy link
Member

@chengchingwen chengchingwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have the permission, so you would need someone else to review

@yuehhua yuehhua requested a review from CarloLucibello June 6, 2022 15:27
@yuehhua
Copy link
Member Author

yuehhua commented Jun 6, 2022

@CarloLucibello would you please review this?

@CarloLucibello
Copy link
Member

We can make the boundschecking much faster. In the simplest case with integer indexes:

a = axes(src, ndims(src))
checkindex(a, collect(extrema(idx)))

@yuehhua
Copy link
Member Author

yuehhua commented Jun 7, 2022

I just try it. It seems that extrema cause the scalar indexing issue for CUDA. But it works separately with maximum and minimum.

@chengchingwen
Copy link
Member

I think there won't be much difference in speed since they are all calling the mapreduce kernel

@CarloLucibello
Copy link
Member

We can check just the maximum (we don't really support things like offset arrays in any case).

@yuehhua
Copy link
Member Author

yuehhua commented Jun 7, 2022

I think @chengchingwen is right. They all pass through the same GPU kernel such that computation over an array costs the same time as computing a single value. Since maximum and minimum are needed for extrema, it calls GPU kernel twice for maximum and minimum, separately.

using CUDA
using BenchmarkTools

T = Float32
CT = CuArray{Float32}

src = CT([3, 4, 5, 6, 7])
idx = cu([1 2 3 4;
            4 2 1 3;
            3 5 5 3])

function checkbounds_src(src, dims::Union{Int, Val}, ::Type{<:Any})
    return i -> checkbounds(Bool, src, ntuple(x -> Colon(), dims)..., i...)
end

function checkbounds_src(src, dims::Union{Int, Val}, ::Type{<:CartesianIndex})
    return i -> checkbounds(Bool, src, ntuple(x -> Colon(), dims)..., i)
end

function checkbounds1(src, idx, dims)
    return map(checkbounds_src(src, Val(dims), eltype(idx)), idx)
end

function checkbounds2(src, idx, dims)
    a = axes(src, ndims(src))
    return checkindex(Bool, a, minimum(idx):maximum(idx))
end

checkbounds1(src, idx, 1)
checkbounds2(src, idx, 1)

julia> @benchmark CUDA.@sync checkbounds1($src, $idx, 1)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  10.406 μs  73.799 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     11.080 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.074 μs ±  3.770 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▆▅▆▅▄▃▂▁                                                  ▂
  █████████████▇▇▇▇▆▇▆▅▄▄▄▄▄▃▄▄▂▃▅▄▄▂▄▄▆▅▄▅▅▄▅▅▄▄▄▅▂▄▄▅▅▄▅▄▃▃ █
  10.4 μs      Histogram: log(frequency) by time      32.7 μs <

 Memory estimate: 3.12 KiB, allocs estimate: 56.

julia> @benchmark CUDA.@sync checkbounds2($src, $idx, 1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  31.644 μs   39.208 ms  ┊ GC (min  max): 0.00%  44.45%
 Time  (median):     33.708 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   40.350 μs ± 391.981 μs  ┊ GC (mean ± σ):  4.32% ±  0.44%

  ▅██▇▆▄▂▁          ▁▁▁▁                                       ▂
  █████████████▇█▇▇▇█████▇▇▇▇▇▇▆▅▆▅▆▅▇▆▆▆▆▇▆▆▆▆▅▅▅▄▄▅▅▄▄▅▄▅▅▅▄ █
  31.6 μs       Histogram: log(frequency) by time      79.8 μs <

 Memory estimate: 4.88 KiB, allocs estimate: 88.

The original approach (checkbounds1) performs better speed than proposed approach (checkbounds2).

@YichengDWu
Copy link

Any updates?

@yuehhua
Copy link
Member Author

yuehhua commented Jun 14, 2022

Wait for review.

@CarloLucibello
Copy link
Member

@yuehhua can you benchmark this PR vs master for a few input sizes? Just to make sure that boundschecking doesn't take more than the real computation

@yuehhua
Copy link
Member Author

yuehhua commented Jun 15, 2022

Benchmark code:

using NNlib, NNlibCUDA
using CUDA
using BenchmarkTools

T = Float32
CT = CuArray{Float32}

src = CT([3, 4, 5, 6, 7]);
idx = cu([1 2 3 4;
                   4 2 1 3;
                   3 5 5 3]);

NNlib.gather(src, idx);

This PR:

julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  28.514 μs …  29.643 ms  ┊ GC (min … max): 0.00% … 35.08%
 Time  (median):     33.589 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   37.676 μs ± 296.188 μs  ┊ GC (mean ± σ):  2.76% ±  0.35%

         ▂▇█▇▅▃▁                                                
  ▂▄▆▆▆▇▇███████▇▇▆▆▆▅▄▄▄▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  28.5 μs         Histogram: frequency by time         56.4 μs <

 Memory estimate: 6.05 KiB, allocs estimate: 109.

master branch:

julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.472 μs … 55.821 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.803 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.026 μs ±  1.099 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █                                                          
  ▅██▆██▄▃▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  6.47 μs        Histogram: frequency by time        11.5 μs <

 Memory estimate: 544 bytes, allocs estimate: 12.

It seems to take around 5 times slower than master branch.

@CarloLucibello
Copy link
Member

Mhmh. That is very small size though, can you check with 10x or 100x bigger arrays?

@yuehhua
Copy link
Member Author

yuehhua commented Jun 15, 2022

PR:

julia> src = CUDA.rand(50);

julia> idx = rand(1:50, 30, 4) |> cu;

julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  30.024 μs …  34.845 ms  ┊ GC (min … max): 0.00% … 37.69%
 Time  (median):     34.888 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   40.244 μs ± 348.197 μs  ┊ GC (mean ± σ):  3.26% ±  0.38%

    ▃▆█▆▇▅▃▁                                                    
  ▁▆████████▇▆▅▄▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  30 μs           Histogram: frequency by time         72.4 μs <

 Memory estimate: 6.05 KiB, allocs estimate: 109.

master branch:

julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.811 μs … 104.485 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.106 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.516 μs ±   1.748 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▅▆▆▄▄▃▂▁                                                  ▂
  ████████████▇██▇▇▇█████▇▇▇▅▅▅▅▆▅▄▅▄▄▄▄▄▅▅▄▄▅▅▆▅▅▆▅▅▄▆▄▄▄▄▄▂ █
  6.81 μs      Histogram: log(frequency) by time      14.7 μs <

 Memory estimate: 544 bytes, allocs estimate: 12.

@yuehhua
Copy link
Member Author

yuehhua commented Jun 15, 2022

It seems to take too much GC time. Maybe the closure causes this.

@YichengDWu
Copy link

is map using CUDA kernels?

src/gather.jl Outdated

# check bounds
in_bnd = map(checkbounds_src(src, Val(dims), eltype(idx)), idx)
if !all(in_bnd)
Copy link

@YichengDWu YichengDWu Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line slows down the code to a great portion

@YichengDWu
Copy link

I have done some tests, the closure here is fine. It won't help to use Base.Fix2

@yuehhua
Copy link
Member Author

yuehhua commented Jun 16, 2022

The closure is resolved and benchmarked below:

julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  22.388 μs … 797.144 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     30.675 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.996 μs ±  10.354 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▂▅█▄                                            
  ▁▂▂▁▁▁▁▆▇██▇██████▇▄▄▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  22.4 μs         Histogram: frequency by time         55.6 μs <

 Memory estimate: 3.08 KiB, allocs estimate: 53.

@MilkshakeForReal map does use CUDA kernel. However, the map CUDA kernel itself cost large portion of time. I turned of the map CUDA kernel and got:

julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.588 μs … 95.860 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.985 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.414 μs ±  1.796 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▇▇▅▅▄▃▂▂▂▂▂▁▁                                            ▂
  ██████████████████▇▇▇▇▆▆▇▆▇▆▇▇▇▇▇▇▆▆▆▆▆▇▆▆▄▅▆▆▅▅▅▆▅▃▅▃▅▄▆▅ █
  6.59 μs      Histogram: log(frequency) by time     14.6 μs <

 Memory estimate: 544 bytes, allocs estimate: 12.

It is still 4 times slower. So, we could come out with a more efficient CUDA kernel or find other efficient way to check bounds, or even no bound checks.

@YichengDWu
Copy link

For the last version, commenting out if !all(in_bnd) reduces the time from 68 μs to 25 μs on my machine. And it's 15 μs without bounds checking. So I concluded that map was fine but reduce was slower.

@yuehhua
Copy link
Member Author

yuehhua commented Jun 16, 2022

Drop @inbounds in CUDA kernel to use CUDA's bound check.

julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.654 μs … 111.630 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.214 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.549 μs ±   1.843 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▇▇█▅▆▅▄▂▂▂▂▂▂▁▂▁▁                                         ▂
  ███████████████████████▇▇▇▆▆▆▆▇▆▅▅▆▅▆▄▅▅▅▄▅▄▅▆▄▄▃▅▅▅▅▃▅▅▅▄▄ █
  6.65 μs      Histogram: log(frequency) by time      14.5 μs <

 Memory estimate: 528 bytes, allocs estimate: 12.

@yuehhua
Copy link
Member Author

yuehhua commented Jun 16, 2022

@MilkshakeForReal For the last version, all(in_bnd) is merged into mapreduce and it reduces from 37.676 μs to 30.996 μs. Just small improvement.

@YichengDWu
Copy link

MRE:

function NNlib.gather!(dst::AnyCuArray, src::AnyCuArray, idx::AnyCuArray)
    # check dims
    dims = gather_check_dims(src, dst, idx)
    dims_size = size(src)[1:dims]
    max_dims_idx = prod(dims_size)
    max_idx = max_dims_idx * length(idx)

    # check bounds
    idx_bounds = size(src, ndims(src))#[dims+1:end]
    in_bnd = map(i -> i <= idx_bounds, idx)

    isempty(src) && return dst

    # cuda kernel
    args = dst, src, idx, max_idx, max_dims_idx, dims_size
    kernel = @cuda launch=false gather_kernel!(args...)
    config = launch_configuration(kernel.fun; max_threads=256)
    threads = min(max_idx, config.threads)
    blocks = cld(max_idx, threads)
    kernel(args...; threads=threads, blocks=blocks)
    return dst
end
julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  18.900 μs   2.720 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     21.700 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.564 μs ± 71.784 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▆██▇▆▅▄▃▃▃▂▂▁▁▁▁▁▁▂▁▁▁                                     ▂
  ███████████████████████████▇▇██▇▆▇▇▇▇▆▆▇▇▆▆▆▆▅▅▆▄▄▄▄▅▃▅▂▃▄▃ █
  18.9 μs      Histogram: log(frequency) by time      60.7 μs <

 Memory estimate: 1.33 KiB, allocs estimate: 31.

With !all(in_bnd):

julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  55.200 μs    2.510 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     69.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   79.856 μs ± 102.221 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂█▇▃▂▁
  ▁▃██████▆▅▆▄▅▇▇▆▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  55.2 μs         Histogram: frequency by time          174 μs <

 Memory estimate: 3.72 KiB, allocs estimate: 72.

Without any bounds checking:

julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  12.600 μs …  2.407 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.700 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.899 μs ± 48.924 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅▆██▆▄▃▄▃▂▁ ▁▁     ▁▂▃▃▃▄▃▂▂▂▂▂▂▂                          ▂
  █████████████████▇█▇██████████████▇█▇▆▆▅▆▅▇▇▅▆▆▆▅▅▅▅▆██▆▆▇▅ █
  12.6 μs      Histogram: log(frequency) by time      51.5 μs <

 Memory estimate: 544 bytes, allocs estimate: 12.

@YichengDWu
Copy link

YichengDWu commented Jun 16, 2022

@MilkshakeForReal For the last version, all(in_bnd) is merged into mapreduce and it reduces from 37.676 μs to 30.996 μs. Just small improvement.

I believe you can achieve almost identical speedup by writing

in_bnd = mapreduce(checkbounds_src(src, Val(dims), eltype(idx)), &, idx)

in the last version. The speedup does not majorly come from resolving closure.

@YichengDWu
Copy link

YichengDWu commented Jun 16, 2022

This is my implementation based on yours. It seems working fine. I still don't know why we need to dispatch dims.

function _checkbounds_indices(i::Tuple, idx_bounds::Tuple)
    return Base.checkbounds_indices(Bool, idx_bounds, i) 
end

function _checkbounds_indices(i::CartesianIndex, idx_bounds::Tuple)
    return Base.checkbounds_indices(Bool, idx_bounds, Tuple(i))
end

function _checkbounds_indices(i::Int, idx_bounds::Tuple)
    return Base.checkbounds_indices(Bool, idx_bounds, (i,))
end

function NNlib.gather!(dst::AnyCuArray, src::AnyCuArray, idx::AnyCuArray)
    # check dims
    dims = gather_check_dims(src, dst, idx)
    dims_size = size(src)[1:dims]
    max_dims_idx = prod(dims_size)
    max_idx = max_dims_idx * length(idx)

    # check bounds
    idx_bounds = axes(src)[dims+1:end]
    in_bnd = mapreduce(Base.Fix2(_checkbounds_indices,idx_bounds), &, idx)

    if !in_bnd
        #whatever is here, we don't need to care about the speed when something is wrong.
    end

    isempty(src) && return dst

    # cuda kernel
    args = dst, src, idx, max_idx, max_dims_idx, dims_size
    kernel = @cuda launch=false gather_kernel!(args...)
    config = launch_configuration(kernel.fun; max_threads=256)
    threads = min(max_idx, config.threads)
    blocks = cld(max_idx, threads)
    kernel(args...; threads=threads, blocks=blocks)
    return dst
end

@YichengDWu
Copy link

YichengDWu commented Jun 16, 2022

@MilkshakeForReal For the last version, all(in_bnd) is merged into mapreduce and it reduces from 37.676 μs to 30.996 μs. Just small improvement.

The size of the improvement by removing all(in_bnd) is the same magnitude as the one of the latest version compared to the last one

@yuehhua
Copy link
Member Author

yuehhua commented Jun 16, 2022

The size of the improvement by removing all(in_bnd) is the same magnitude as the one of the latest version compared to the last one

# check bounds
idx_bounds = size(src, ndims(src))#[dims+1:end]
in_bnd = map(i -> i <= idx_bounds, idx)

We still have to raise a bound check error to users, otherwise there is meaningless to do bound check. You could have other ways to replace all(in_bnd), but still have to summarize your in_bnd from map.

@YichengDWu
Copy link

It's just to demonstrate we don't really need to avoid closure in this case. I just tested the latest version and the performance was almost identical to mine, with closure or not. Both of them use mapreduce. Throwing the error msg could be done in a slower fashion since if something is wrong we just want it to stop and really don't care too much about the speed. We can even use map again after if !in_bnd.

@chengchingwen
Copy link
Member

@MilkshakeForReal avoiding closure is for reducing the gc time and memory allocation.

@yuehhua
Copy link
Member Author

yuehhua commented Jun 16, 2022

Not much improvement for the latest proposal.

julia> @benchmark CUDA.@sync NNlib.gather($src, $idx)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  24.986 μs … 618.544 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     32.902 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   33.122 μs ±   7.792 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▁    ▁▂▄██▃                                          
  ▂▃▂▂▁▁▃▇▇██▇▇███████▆▅▄▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  25 μs           Histogram: frequency by time         54.6 μs <

 Memory estimate: 3.81 KiB, allocs estimate: 73.

@YichengDWu
Copy link

Do you mean my proposal? There isn't any improvement. Just similar performance.

@yuehhua
Copy link
Member Author

yuehhua commented Jun 16, 2022

@MilkshakeForReal We still need reduce to summarize a CuArray{Bool} into a single boolean value. A ! cannot be applied on CuArray{Bool} and the if statement cannot work properly.

julia> in_bnd = rand(Bool, 5) |> cu;

julia> if !in_bnd
           println(123)
       end
ERROR: MethodError: no method matching !(::CuArray{Bool, 1, CUDA.Mem.DeviceBuffer})
Closest candidates are:
  !(::Function) at /opt/julia-1.7.3/share/julia/base/operators.jl:1117
  !(::Bool) at /opt/julia-1.7.3/share/julia/base/bool.jl:35
  !(::Missing) at /opt/julia-1.7.3/share/julia/base/missing.jl:101
Stacktrace:
 [1] top-level scope
   @ REPL[12]:1
 [2] top-level scope
   @ ~/.julia/packages/CUDA/GGwVa/src/initialization.jl:52

@YichengDWu
Copy link

YichengDWu commented Jun 16, 2022

Where do we have this issue? In my code its already reduced. The closure is also avoided if we don't want it.

@YichengDWu
Copy link

memory allocation

I don't know much about the gc time, just speaking from my test results

@YichengDWu
Copy link

Could we isolate the support for empty arrays and leave bounds checking to further discussion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add bounds checking in gather
4 participants