- 
                Notifications
    You must be signed in to change notification settings 
- Fork 212
Faster base conversion #580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
| Yep, feared that ;-) | 
f68379e    to
    2118eed      
    Compare
  
    | Took me a bit but the test for the largest possible input for MP_64BIT run successfully. Input number generated with Pari/GP Largest possible number for 60-bit limbs would be  This result has not been made with the develop branch (would have taken weeks) but with some additional fast algorithms: TC 2-5 , FFT(FHT), Newton-division (no NTT yet). TC only up to 5 because the lower cutoff of FFT was so close to TC-5 that I called it: good enough for this test. Multi-threading by a handful of simple OpenMP instructions, so there is some room for improvement. Test run on an older i7-2600 with 4 cores/8 threads. | 
| 
 
 Memory used was not measured exactly but was about 6.5 Gibibytes (Pari/GP uses less but it uses fully filled limbs). | 
| For  
 | 
| For  
 | 
| Ah, I see. | 
| Speed-test gets reinstated when cutoff-tuning is implemented. | 
| Other limb-sizes: There is also a spot in the code in  There are not a lot of differences here, a fixed cutoff at 500 bits (rounded to next limb-size) would make the most sense for both: as a general cutoff and the cutoff between the build-in high short product and full multiplication. as said before: it is already a lot of code for a "simple" radix conversion! | 
| What? Ah,  | 
| Line 29 in         && (b->used > (2 * MP_MUL_KARATSUBA_CUTOFF))Great! *sigh* | 
| It is already a lot of code. Not so much for reading but quite a lot for writing. So I like to stop here. No more cutoff branches (only bases 16 and 64 would be of interest anyways), and no more internal branches except for radices of the form  So without any complains I like to give it a wet wipe and call it done. OK? | 
| 
 Absolutely OK :) Those graphs look good. Do I understand correctly that the "slow path" is the old version? | 
| 
 They are the old ones, yes, but with the added string positioning which shouldn't add much of runtime because it is outside of the loops. Yes, my ability to give the children good names is still highly underdeveloped, sorry ;-) | 
a1417cf    to
    eeed7bd      
    Compare
  
    | As always: when you thought you had it all ... ;-) | 
2b6a2f1    to
    5ece04f      
    Compare
  
    | Hold your horses, please, I just saw, that I forgot to add the printing part in  | 
| No, I don't know how that happened (the rotation of  You might try  | 
| Tuning does not work in  | 
| 
 Sure, make a separate PR. Did you check whether the amalgamated version of the library results in different tuning parameters? | 
| 
 No, I was just happy that it works in the first place! ;-) But I'll take a look, of course. 
 Good question. I'll take a look. (I think we would need the full triplet tune->profile->tune) 
 Nearly as much as in the cereals aisle at Walmart ;-) But I found a problem with  #define MP_DEFAULT_MUL_KARATSUBA_CUTOFF 115
#define MP_DEFAULT_SQR_KARATSUBA_CUTOFF 152
#define MP_DEFAULT_MUL_TOOM_CUTOFF      139
#define MP_DEFAULT_SQR_TOOM_CUTOFF      212
#define MP_DEFAULT_MUL_TOOM_4_CUTOFF    218
#define MP_DEFAULT_SQR_TOOM_4_CUTOFF    256
#define MP_DEFAULT_MUL_TOOM_5_CUTOFF    254
#define MP_DEFAULT_SQR_TOOM_5_CUTOFF    256
#define MP_DEFAULT_MUL_TOOM_6_CUTOFF    256
#define MP_DEFAULT_SQR_TOOM_6_CUTOFF    256
#define MP_DEFAULT_MUL_FFT_CUTOFF       608
#define MP_DEFAULT_SQR_FFT_CUTOFF       708Which is clearly nonsense. (256 limbs is the COMBA size on this machine btw) I mean: my implementation of FFT is fast but I'm pretty sure not that fast ;-) #define MP_DEFAULT_MUL_KARATSUBA_CUTOFF 109
#define MP_DEFAULT_SQR_KARATSUBA_CUTOFF 138
#define MP_DEFAULT_MUL_TOOM_CUTOFF      149
#define MP_DEFAULT_SQR_TOOM_CUTOFF      266
#define MP_DEFAULT_MUL_TOOM_4_CUTOFF    880
#define MP_DEFAULT_SQR_TOOM_4_CUTOFF    2952
#define MP_DEFAULT_MUL_TOOM_5_CUTOFF    982
#define MP_DEFAULT_SQR_TOOM_5_CUTOFF    927
#define MP_DEFAULT_MUL_TOOM_6_CUTOFF    1251
#define MP_DEFAULT_SQR_TOOM_6_CUTOFF    1083
/* Results with steps of 100 and (1000) resp. */
#define MP_DEFAULT_MUL_FFT_CUTOFF       24208 (47008)
#define MP_DEFAULT_SQR_FFT_CUTOFF       12308 (43008)That looks a bit more reasonable. There is not much difference up to  Why is   (git blame   --line-porcelain    etc/tune_it.sh; git blame   --line-porcelain    etc/tune.c ) \
   |  sed -n 's/^author //p' | sort | uniq -c | sort -rnAh, I forgot, never mind ;-) The outlier  And because I know you like a pretty picture or two: Ok, the FFT cutoffs from the beginning were not a lie ;-) TC-7 is in the works (some coefficients are too large even for 32-bit) and TC-8 has a bug in the implementation (PARI/GP script works, though), With TC-9 and above the coefficients get larger than 64 bit. | 
| 
 Me and my big mouth! ;-) I see that  The rest of the TODOs in this PR are mostly optimizations that need external work (e.g. short products) or small things that are not function related and hence not urgent. Will again shove my mop through it and wrap it up for good now. I know my tendency for "featuritis" too well ;-). | 
ea9a964    to
    a492616      
    Compare
  
    



















This PR includes both variations to implement the Schönhage trick: the standard way and the method proposed by Lars Helmström to compute the reciprocals.
The default is the normal way. Switch to the second method which uses a round of Newton-Raphson (N-R-method) with
make "CFLAGS=-DMP_TO_RADIX_USE_NEWTON_RAPHSON " testFound no significant difference in speed, but YMMV, as always.
The cutoffs are all regulated by
MP_RADIX_BARRETT_START_MULTIPLICATOR, no finer resolution for now. Default isMP_RADIX_BARRETT_START_MULTIPLICATOR=10. There is not much diffference but there is:(Timings are read/write combined)
The timing used in
demo/test.cto check if the faster method is actually faster is switched off if it runs in Valgrind.There are a lot of edgecases, so it is a lot to test. On the upper end with
MP_64BITand base 10, the N-R-method works well up to2^(2^27), the normal way up to2^(2^30)( Up toMP_MAX_DIGIT_COUNT - 4in reading, writing is still running )More information in the comments in the code.