Faster _fmpz_vec_scalar_divexact_ui #2371

fredrik-johansson · 2025-07-19T09:03:14Z

Speed up _fmpz_vec_scalar_divexact_si, _fmpz_vec_scalar_divexact_ui and small-divisor _fmpz_vec_scalar_divexact_fmpz by precomputing an inverse mod 2^FLINT_BITS. Also special case powers of two.

This probably will not make a measurable difference on any macrobenchmarks, because the standard use case is to divide out a GCD which was more expensive to compute in the first place. However, I thought I'd put this code in as a template for doing divexact with precomputed inverses; other functions like fmpz_mat_fflu and _fmpz_poly_interpolate_exact_newton could use the same trick in the future.

Multi-limb preinverted divexact should also be done some day but it's a lot more work.

Profile code output:

$ build/fmpz_vec/profile/p-divexact_ui 
   len    bits(A*c)  bits(c)      old       new    speedup
     1         2        2    2.77e-09   3.67e-09   0.755
     1         0       46    2.81e-09   3.46e-09   0.812
     1        66       51    2.61e-08   2.66e-08   0.981
     1         0       33    2.83e-09   3.47e-09   0.816
     1        79       44    2.34e-08   2.41e-08   0.971
     1       112       36       1e-08   1.06e-08   0.943
     1         0       44    2.83e-09   3.48e-09   0.813
     1      -518       34    1.88e-08   1.88e-08   1.000
     1       375       43    1.36e-08   1.36e-08   1.000
     1       929       17    2.64e-08   2.66e-08   0.992
     1      -114       54    2.37e-08   2.43e-08   0.975
     2         0       45    5.71e-09   7.27e-09   0.785
     2       -64       55    2.51e-08   7.06e-09   3.555
     2       -46       30     5.7e-09   7.24e-09   0.787
     2       -59       22    5.85e-09   6.64e-09   0.881
     2        45       17    5.99e-09   7.32e-09   0.818
     2      -170       63    2.06e-08   1.67e-08   1.234
     2         0       31    5.78e-09    6.6e-09   0.876
     2       564       18    2.08e-08    2.1e-08   0.990
     2       817       22    4.74e-08   4.54e-08   1.044
     2     -2270       43    1.21e-07   1.24e-07   0.976
     2     -4587       17    2.38e-07   2.42e-07   0.983
     3        58       54    8.15e-09   8.68e-09   0.939
     3       -58       56    8.18e-09   7.46e-09   1.097
     3       -28       13    8.15e-09   8.66e-09   0.941
     3       -39        7    8.12e-09   7.44e-09   1.091
     3      -131       55    5.78e-08   1.64e-08   3.524
     3      -133       21    2.25e-08   1.47e-08   1.531
     3      -262       23    3.42e-08   2.85e-08   1.200
     3      -497       34    4.78e-08   3.71e-08   1.288
     3     -1095       41    7.62e-08   6.96e-08   1.095
     3      2392       52    1.63e-07   1.62e-07   1.006
     3     -2164       30    6.37e-08   6.78e-08   0.940
     4       -31       27    1.08e-08   9.72e-09   1.111
     4       -27       20    1.07e-08   8.64e-09   1.238
     4       -64       64    6.98e-08   1.03e-08   6.777
     4       -91       53    9.38e-08    1.4e-08   6.700
     4       -81        4    2.51e-08   2.36e-08   1.064
     4      -165       64    3.24e-08    2.3e-08   1.409
     4      -365       56    5.74e-08   3.15e-08   1.822
     4      -488       47    4.56e-08   3.53e-08   1.292
     4     -1248       60    1.04e-07   7.02e-08   1.481
     4     -2282       30    2.04e-07   1.98e-07   1.030
     4      -234        7    1.86e-08   1.41e-08   1.319
     5        54       54    1.35e-08   9.52e-09   1.418
     5        32       24    1.36e-08   1.14e-08   1.193
     5        53       53    1.35e-08   1.14e-08   1.184
     5         0       29    1.35e-08   1.14e-08   1.184
     5      -135       64    8.81e-08   2.24e-08   3.933
     5         0       60    1.39e-08   9.54e-09   1.457
     5      -167       13    2.06e-08   1.38e-08   1.493
     5      -685       51    8.26e-08   6.32e-08   1.307
     5     -1235       27    9.26e-08   7.55e-08   1.226
     5     -2277        5    1.95e-07   1.89e-07   1.032
     5     -4833       22     3.6e-07   3.58e-07   1.006
     6        -6        2    1.61e-08   7.21e-09   2.233
     6       -32       22    1.61e-08   1.27e-08   1.268
     6       -82       62    9.71e-08   1.52e-08   6.388
     6        46       30    1.61e-08   1.07e-08   1.505
     6        75       14    3.64e-08   1.17e-08   3.111
     6      -189       37    4.66e-08   2.98e-08   1.564
     6       342       37    2.59e-08   2.12e-08   1.222
     6         0       41    1.61e-08   1.07e-08   1.505
     6     -1172        8    1.57e-07   1.36e-07   1.154
     6      2384       12    1.36e-07   1.37e-07   0.993
     6     -5083       53    6.66e-07   6.66e-07   1.000
     7       -33       28    1.87e-08   1.41e-08   1.326
     7         0       20    1.87e-08    1.4e-08   1.336
     7       -80       62    1.22e-07   1.75e-08   6.971
     7        69       56     3.9e-08   1.19e-08   3.277
     7      -105       32    5.39e-08   1.97e-08   2.736
     7         0       37    1.87e-08   1.18e-08   1.585
     7      -287       20     2.7e-08   1.86e-08   1.452
     7      -458       43     7.3e-08   5.17e-08   1.412
     7      -957       46    1.06e-07   8.26e-08   1.283
     7     -2357       58    2.86e-07   2.77e-07   1.032
     7     -4344       40    4.56e-07   4.45e-07   1.025
     8         0        9    2.14e-08   1.28e-08   1.672
     8        -4        4    2.14e-08   1.28e-08   1.672
     8       -78       59    1.24e-07    1.7e-08   7.294
     8       -74       37    8.35e-08   1.78e-08   4.691
     8      -116       36    1.13e-07    2.8e-08   4.036
     8      -191       63    3.68e-08   2.44e-08   1.508
     8      -271       36    3.85e-08   2.92e-08   1.318
     8      -398       23    9.22e-08    4.9e-08   1.882
     8     -1115        8    1.61e-07   1.25e-07   1.288
     8     -1371       46     1.1e-07   9.68e-08   1.136
     8     -4883       46     3.9e-07   3.69e-07   1.057
     9         0        4    2.41e-08   1.59e-08   1.516
     9       -42       33    2.41e-08    1.6e-08   1.506
     9       -45       29     2.4e-08   1.59e-08   1.509
     9        53       39     2.4e-08   1.35e-08   1.778
     9      -129       53    1.13e-07   3.16e-08   3.576
     9      -159        1    7.18e-08   5.42e-08   1.325
     9      -326       60     6.1e-08   4.38e-08   1.393
     9         0       62     2.4e-08   1.35e-08   1.778
     9     -1212       62    2.03e-07   1.62e-07   1.253
     9         0       28     2.4e-08   1.59e-08   1.509
     9     -3064       16    1.88e-07   1.75e-07   1.074
    10       -21       18    2.67e-08   1.44e-08   1.854
    10       -10        2    2.67e-08   1.43e-08   1.867
    10       -22        2    2.66e-08   1.44e-08   1.847
    10       -40       23    2.68e-08   1.43e-08   1.874
    10      -102       22    1.14e-07   2.03e-08   5.616
    10       178       64    3.45e-08   2.33e-08   1.481
    10      -299       36    6.36e-08   2.92e-08   2.178
    10       467       49    3.68e-08   2.59e-08   1.421
    10      -855       43    4.86e-08   3.99e-08   1.218
    10     -2122       63    1.22e-07   1.14e-07   1.070
    10     -4740       15    7.86e-07   1.45e-07   5.421

fredrik-johansson · 2025-07-19T09:09:49Z

Would be good to see how this compares on Intel and Apple hardware before merging.

src/fmpz_vec/scalar_divexact.c

fredrik-johansson · 2025-10-31T09:25:23Z

Merging this; although this could potentially be slightly slower on some systems I haven't tested, it should be fine on recent x864-64s and I don't want the PR to linger forever.

fredrik-johansson added 5 commits July 18, 2025 20:43

Faster _fmpz_vec_scalar_divexact for single-word divisors

93a3bbd

Try to fix _gr_vec_divexact test code

57fcdc5

fmpz_mpoly_scalar_divexact_fmpz

7d6daf2

Fix fmpz_mpoly_scalar_divexact_fmpz test

31ccb70

Profiling code; fmpz safety

b3990ee

fredrik-johansson mentioned this pull request Jul 22, 2025

Integer divexact with precomputed inverse #2375

Open

Slightly faster n_binvert

a11d019

albinahlback reviewed Jul 22, 2025

View reviewed changes

src/fmpz_vec/scalar_divexact.c Outdated Show resolved Hide resolved

Missing underscore

3197a77

fredrik-johansson merged commit 0da3327 into flintlib:main Oct 31, 2025
10 checks passed

fredrik-johansson deleted the divexact1 branch October 31, 2025 09:26

fredrik-johansson added performance workshop 2025v2 labels Oct 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster _fmpz_vec_scalar_divexact_ui #2371

Faster _fmpz_vec_scalar_divexact_ui #2371

Uh oh!

fredrik-johansson commented Jul 19, 2025

Uh oh!

fredrik-johansson commented Jul 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

fredrik-johansson commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Faster _fmpz_vec_scalar_divexact_ui #2371

Faster _fmpz_vec_scalar_divexact_ui #2371

Uh oh!

Conversation

fredrik-johansson commented Jul 19, 2025

Uh oh!

fredrik-johansson commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fredrik-johansson commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fredrik-johansson commented Jul 19, 2025 •

edited

Loading