cheb2leg is slow #97

dlfivefifty · 2019-12-27T17:35:21Z

This seems very slow:

julia> @time cheb2leg(randn(1_000));
  0.014712 seconds (9 allocations: 16.438 KiB)

julia> @time cheb2leg(randn(10_000));
  0.491001 seconds (11 allocations: 156.969 KiB)

julia> @time cheb2leg(randn(100_000));
  8.688441 seconds (11 allocations: 1.527 MiB)

dlfivefifty · 2019-12-27T17:37:24Z

I guess its all in the plan:

julia> @time p = plan_cheb2leg(randn(100_000));
  8.466608 seconds (7 allocations: 781.531 KiB)

julia> @time p * randn(100_000);
  0.160250 seconds (172 allocations: 1.536 MiB)

I remember the Toeplitz dot Hankel being much faster.

MikaelSlevinsky · 2019-12-29T18:04:34Z

Indeed, the plans are slower than the Toeplitz-dot-Hankel approach but the execution times are much faster. Comparing with Figure 5 from the Webb, Townsend, Olver paper, I see:

n	ToH	plan_cheb2leg	p * c
1,000	0.05	0.013	0.00043
10,000	0.2	0.37	0.0067
100,000	1.3	8.4	0.091

I think that for multiple transforms, this is a better alternative.

There is one (possibly surprising) reason the plans are slower than before: in C, the Float64 plans are first created in long double precision before being converted to double precision data structures to squeeze out the trailing bits more accurately. Similarly, in Float32, they are first created in double precision.

If one would accept 12-14 digits instead of 14-16, then the purely 64-bit plans would be fine for 64-bit execution. The plan time at 100_000 is then just a bit less than the current Float32 plan: @time p = plan_cheb2leg(randn(Float32, 100_000)); 2.153270 seconds (5 allocations: 208 bytes).

Notice that now, the 1D plans are also multithreaded, so:

julia> c = randn(Float64, 100_000);

julia> @time p = plan_cheb2leg(c);
  8.197012 seconds (5 allocations: 208 bytes)

julia> @time p*c;
  0.092681 seconds (8 allocations: 781.844 KiB)

julia> x = randn(Float64, 100_000, 20);

julia> @time p*x;
  0.514451 seconds (8 allocations: 15.259 MiB)

With four logical cores, the twenty-column execution is not twenty times slower.

dlfivefifty · 2019-12-29T18:32:12Z

Ok so we should use Toeplitz dot Hankel for cheb2leg as the plan is only used once, but the current plan for plan_cheb2leg with the assumption that if you are planning you’ll be using many times.

MikaelSlevinsky · 2019-12-29T19:03:19Z

Better would be to implement the Alpert--Rokhlin scheme in C for cases where the connection coefficients and bounds on off-diagonal block ranks are all known. See #75 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cheb2leg is slow #97

cheb2leg is slow #97

dlfivefifty commented Dec 27, 2019

dlfivefifty commented Dec 27, 2019

MikaelSlevinsky commented Dec 29, 2019

dlfivefifty commented Dec 29, 2019

MikaelSlevinsky commented Dec 29, 2019

cheb2leg is slow #97

cheb2leg is slow #97

Comments

dlfivefifty commented Dec 27, 2019

dlfivefifty commented Dec 27, 2019

MikaelSlevinsky commented Dec 29, 2019

dlfivefifty commented Dec 29, 2019

MikaelSlevinsky commented Dec 29, 2019