Skip to content

Conversation

@jishnub
Copy link
Member

@jishnub jishnub commented Oct 27, 2025

map is a simpler operation and uses linear indexing for Arrays. This often improves performance (occasionally enabling vectorization) and improves TTFX in common cases. It also automatically returns the correct result for 0-D arrays, unlike broadcasting that returns a scalar.

Performance:

julia> A = ones(3,3);

julia> @btime $A + $A;
  44.622 ns (2 allocations: 144 bytes) # v"1.13.0-DEV.1387"
  29.047 ns (2 allocations: 144 bytes) # this PR

julia> A = ones(3,3000);

julia> @btime $A + $A;
  10.095 μs (3 allocations: 70.40 KiB) # v"1.13.0-DEV.1387"
  4.787 μs (3 allocations: 70.40 KiB) # this PR

julia> @btime A + B + C + D + E + F setup=(A = rand(200,200); B = rand(200,200); C = rand(200,200); D = rand(200,200); E = rand(200,200); F = rand(200,200));
  93.910 μs (3 allocations: 312.59 KiB) # v"1.13.0-DEV.1387"
  64.813 μs (9 allocations: 312.77 KiB) # this PR

Similarly for -.

TTFX:

julia> A = ones(3,3);

julia> @time A + A;
  0.174090 seconds (303.47 k allocations: 14.575 MiB, 99.98% compilation time) # v"1.13.0-DEV.1387"
  0.072748 seconds (220.27 k allocations: 11.139 MiB, 99.95% compilation time) # this PR

These are measured on

julia> versioninfo()
Julia Version 1.13.0-DEV.1388
Commit c5f492781e (2025-10-27 11:44 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i5-10310U CPU @ 1.70GHz
  WORD_SIZE: 64
  LLVM: libLLVM-20.1.8 (ORCJIT, skylake)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 8 virtual cores)
Environment:
  LD_LIBRARY_PATH = /usr/local/lib:
  JULIA_EDITOR = subl

@jishnub jishnub added performance Must go faster arrays [a, r, r, a, y, s] latency Latency labels Oct 27, 2025
@Seelengrab
Copy link
Contributor

Performance: [...]
Similarly for -.

Are these representative? The arrays being passed in are exactly the same array after all, so it's not unlikely that there is some special casing going on with map that doesn't happen with the more complicated broadcast machinery.

@jishnub
Copy link
Member Author

jishnub commented Oct 27, 2025

That's a good point! I've re-run the benchmarks, and some of these do hold up in more general cases:

julia> @btime A + B setup=(A = rand(3,3); B = rand(3,3));
  39.452 ns (2 allocations: 144 bytes) # v"1.13.0-DEV.1387"
  27.789 ns (2 allocations: 144 bytes) # this PR

julia> @btime A + B setup=(A = rand(3,3000); B = rand(3,3000));
  10.130 μs (3 allocations: 70.40 KiB) # v"1.13.0-DEV.1387"
  5.026 μs (3 allocations: 70.40 KiB)  # this PR

The difference in the 300x300 case seems spurious, so I'll remove it from the OP. If anything, it seems to worsen somewhat in the benchmarks, and I'm not sure why. The performance is nearly identical for 200x200 and 500x500 matrices, so this is probably insignificant.

The main benefit comes in the wide matrix case, where the first dimension is too small for vectorization to kick in. Using linear indexing offers a significant speed-up. This was suggested in #47873 (comment).

@jishnub jishnub requested a review from oscardssmith November 3, 2025 20:18
@jishnub jishnub merged commit b05afe0 into master Nov 11, 2025
7 checks passed
@jishnub jishnub deleted the jishnub/add_map branch November 11, 2025 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrays [a, r, r, a, y, s] latency Latency performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants