Skip to content

Conversation

SwayamInSync
Copy link
Member

This PR adds the patch to provide same value of smallest_subnormal over all platforms with different endianess and fixes #140 .

@SwayamInSync SwayamInSync marked this pull request as draft September 4, 2025 13:37
@ngoldbaum
Copy link
Member

I think you might have found a thread safety issue in the Dragon4 float128 printing. Here's the traceback for the segfault if I remove the mutex you added:

* thread #42, stop reason = EXC_BAD_ACCESS (code=2, address=0x100e98000)
    frame #0: 0x0000000100e6dec0 _quaddtype_main.cpython-313t-darwin.so`Dragon4 [inlined] BigInt_Multiply(result=0x0000000100e91dd8, lhs=0x0000000100e92dd8, rhs=<unavailable>) at dragon4.c:534:24 [opt]
   531 	            } while (largeCur != large->blocks + large->length);
   532
   533 	            DEBUG_ASSERT(resultCur < result->blocks + maxResultLen);
-> 534 	            *resultCur = (npy_uint32)(carry & bitmask_u64(32));
   535 	        }
   536 	    }
   537
Target 0: (python) stopped.
warning: _quaddtype_main.cpython-313t-darwin.so was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) bt
* thread #42, stop reason = EXC_BAD_ACCESS (code=2, address=0x100e98000)
  * frame #0: 0x0000000100e6dec0 _quaddtype_main.cpython-313t-darwin.so`Dragon4 [inlined] BigInt_Multiply(result=0x0000000100e91dd8, lhs=0x0000000100e92dd8, rhs=<unavailable>) at dragon4.c:534:24 [opt]
    frame #1: 0x0000000100e6de2c _quaddtype_main.cpython-313t-darwin.so`Dragon4 [inlined] BigInt_Pow10(result=<unavailable>, exponent=1, temp=<unavailable>) at dragon4.c:633:13 [opt]
    frame #2: 0x0000000100e6ddb4 _quaddtype_main.cpython-313t-darwin.so`Dragon4(bigints=0x0000000100e8cdd8, exponent=<unavailable>, mantissaBit=<unavailable>, hasUnequalMargins='\0', digitMode=<unavailable>, cutoffMode=<unavailable>, cutoff_max=<unavailable>, cutoff_min=<unavailable>, pOutBuffer=<unavailable>, bufferSize=16384, pOutExponent=0x0000000172e1eb60) at dragon4.c:1169:9 [opt]
    frame #3: 0x0000000100e6d218 _quaddtype_main.cpython-313t-darwin.so`Dragon4_PrintFloat_Sleef_quad [inlined] FormatScientific(buffer=<unavailable>, bufferSize=16384, mantissa=<unavailable>, exponent=<unavailable>, signbit=<unavailable>, mantissaBit=0, hasUnequalMargins='\0', digit_mode=DigitMode_Unique, precision=33, min_digits=0, trim_mode=TrimMode_LeaveOneZero, digits_left=<unavailable>, exp_digits=3) at dragon4.c:1694:17 [opt]
    frame #4: 0x0000000100e6d1cc _quaddtype_main.cpython-313t-darwin.so`Dragon4_PrintFloat_Sleef_quad [inlined] Format_floatbits(buffer=<unavailable>, bufferSize=16384, mantissa=<unavailable>, exponent=<unavailable>, signbit=<unavailable>, mantissaBit=0, hasUnequalMargins='\0', opt=<unavailable>) at dragon4.c:1840:16 [opt]
    frame #5: 0x0000000100e6d12c _quaddtype_main.cpython-313t-darwin.so`Dragon4_PrintFloat_Sleef_quad(value=<unavailable>, opt=<unavailable>) at dragon4.c:1914:12 [opt]
    frame #6: 0x0000000100e6ee84 _quaddtype_main.cpython-313t-darwin.so`Dragon4_Scientific_QuadDType [inlined] Dragon4_Scientific_QuadDType_opt(val=<unavailable>, opt=0x0000000172e1ebe8) at dragon4.c:1954:9 [opt]
    frame #7: 0x0000000100e6ee7c _quaddtype_main.cpython-313t-darwin.so`Dragon4_Scientific_QuadDType(val=<unavailable>, digit_mode=DigitMode_Unique, precision=<unavailable>, min_digits=<unavailable>, sign=<unavailable>, trim=<unavailable>, pad_left=<unavailable>, exp_digits=<unavailable>) at dragon4.c:1978:12 [opt]
    frame #8: 0x0000000100e6a764 _quaddtype_main.cpython-313t-darwin.so`QuadPrecision_repr_dragon4(self=0x000002401cc9e2c0) at scalar.c:0 [opt]
    frame #9: 0x000000010077b044 libpython3.13t.dylib`PyObject_Repr + 108
    frame #10: 0x0000000100775ec8 libpython3.13t.dylib`cfunction_vectorcall_O + 408
    frame #11: 0x00000001007116f8 libpython3.13t.dylib`PyObject_Vectorcall + 88
    frame #12: 0x0000000100855c0c libpython3.13t.dylib`_PyEval_EvalFrameDefault + 36992
    frame #13: 0x0000000100714560 libpython3.13t.dylib`method_vectorcall + 316
    frame #14: 0x0000000100940230 libpython3.13t.dylib`thread_run + 128
    frame #15: 0x00000001008d64d8 libpython3.13t.dylib`pythread_wrapper + 28
    frame #16: 0x000000019c4f3c0c libsystem_pthread.dylib`_pthread_start + 136

I'm going to try this again with a TSan build of Python to see where the first data race happens...

@ngoldbaum
Copy link
Member

Here's the race:

WARNING: ThreadSanitizer: data race (pid=15439)
  Write of size 4 at 0x000164124ea0 by thread T391:
    #0 Dragon4_PrintFloat_Sleef_quad dragon4.c:1913 (_quaddtype_main.cpython-314t-darwin.so:arm64+0x100dc)
    #1 Dragon4_Scientific_QuadDType dragon4.c:1978 (_quaddtype_main.cpython-314t-darwin.so:arm64+0x122a4)
    #2 QuadPrecision_repr_dragon4 scalar.c (_quaddtype_main.cpython-314t-darwin.so:arm64+0xb288)
    #3 PyObject_Repr object.c:779 (libpython3.14t.dylib:arm64+0x1356bc)
    #4 builtin_repr bltinmodule.c:2571 (libpython3.14t.dylib:arm64+0x276b50)
    #5 cfunction_vectorcall_O methodobject.c:536 (libpython3.14t.dylib:arm64+0x12d044)
    #6 PyObject_Vectorcall call.c:327 (libpython3.14t.dylib:arm64+0x8aea8)
    #7 _PyEval_EvalFrameDefault generated_cases.c.h:1619 (libpython3.14t.dylib:arm64+0x27f504)
    #8 _PyEval_Vector ceval.c:1965 (libpython3.14t.dylib:arm64+0x27b278)
    #9 _PyFunction_Vectorcall call.c (libpython3.14t.dylib:arm64+0x8b4fc)
    #10 method_vectorcall classobject.c:73 (libpython3.14t.dylib:arm64+0x8fa34)
    #11 context_run context.c:728 (libpython3.14t.dylib:arm64+0x2c9790)
    #12 _PyEval_EvalFrameDefault generated_cases.c.h:3744 (libpython3.14t.dylib:arm64+0x2859c4)
    #13 _PyEval_Vector ceval.c:1965 (libpython3.14t.dylib:arm64+0x27b278)
    #14 _PyFunction_Vectorcall call.c (libpython3.14t.dylib:arm64+0x8b4fc)
    #15 method_vectorcall classobject.c:73 (libpython3.14t.dylib:arm64+0x8fa34)
    #16 _PyObject_Call call.c:348 (libpython3.14t.dylib:arm64+0x8b160)
    #17 PyObject_Call call.c:373 (libpython3.14t.dylib:arm64+0x8b1d8)
    #18 thread_run _threadmodule.c:359 (libpython3.14t.dylib:arm64+0x41e014)
    #19 pythread_wrapper thread_pthread.h:242 (libpython3.14t.dylib:arm64+0x36e878)

  Previous write of size 4 at 0x000164124ea0 by thread T392:
    #0 Dragon4_PrintFloat_Sleef_quad dragon4.c:1913 (_quaddtype_main.cpython-314t-darwin.so:arm64+0x100dc)
    #1 Dragon4_Scientific_QuadDType dragon4.c:1978 (_quaddtype_main.cpython-314t-darwin.so:arm64+0x122a4)
    #2 QuadPrecision_repr_dragon4 scalar.c (_quaddtype_main.cpython-314t-darwin.so:arm64+0xb288)
    #3 PyObject_Repr object.c:779 (libpython3.14t.dylib:arm64+0x1356bc)
    #4 builtin_repr bltinmodule.c:2571 (libpython3.14t.dylib:arm64+0x276b50)
    #5 cfunction_vectorcall_O methodobject.c:536 (libpython3.14t.dylib:arm64+0x12d044)
    #6 PyObject_Vectorcall call.c:327 (libpython3.14t.dylib:arm64+0x8aea8)
    #7 _PyEval_EvalFrameDefault generated_cases.c.h:1619 (libpython3.14t.dylib:arm64+0x27f504)
    #8 _PyEval_Vector ceval.c:1965 (libpython3.14t.dylib:arm64+0x27b278)
    #9 _PyFunction_Vectorcall call.c (libpython3.14t.dylib:arm64+0x8b4fc)
    #10 method_vectorcall classobject.c:73 (libpython3.14t.dylib:arm64+0x8fa34)
    #11 context_run context.c:728 (libpython3.14t.dylib:arm64+0x2c9790)
    #12 _PyEval_EvalFrameDefault generated_cases.c.h:3744 (libpython3.14t.dylib:arm64+0x2859c4)
    #13 _PyEval_Vector ceval.c:1965 (libpython3.14t.dylib:arm64+0x27b278)
    #14 _PyFunction_Vectorcall call.c (libpython3.14t.dylib:arm64+0x8b4fc)
    #15 method_vectorcall classobject.c:73 (libpython3.14t.dylib:arm64+0x8fa34)
    #16 _PyObject_Call call.c:348 (libpython3.14t.dylib:arm64+0x8b160)
    #17 PyObject_Call call.c:373 (libpython3.14t.dylib:arm64+0x8b1d8)
    #18 thread_run _threadmodule.c:359 (libpython3.14t.dylib:arm64+0x41e014)
    #19 pythread_wrapper thread_pthread.h:242 (libpython3.14t.dylib:arm64+0x36e878)

  Location is global '_bigint_static' at 0x000164124ea0 (_quaddtype_main.cpython-314t-darwin.so+0x40ea0)

And indeed _bigint_static seems like a likely name for a global variable!

It looks like it is declared to be NPY_TLS, but I guess that's not actually correctly coming out to be the correct incantation to make this variable thread-local. We might have to do something more complicated.

IMO adding a global mutex is not correct, we should fix this issue with the thread-local variable not being thread-local instead.

@ngoldbaum
Copy link
Member

BTW, if you want to experiment with using TSan or want to set up TSan CI, we have docker images you can use: https://github.com/nascheme/cpython_sanity

@ngoldbaum
Copy link
Member

And more docs on using TSan with Python here: https://py-free-threading.github.io/thread_sanitizer/

@ngoldbaum
Copy link
Member

ngoldbaum commented Sep 4, 2025

I guess in order to actually use NPY_TLS in C, you also need to define these HAVE_ variables defined:

https://github.com/numpy/numpy/blob/908e468aff6e6ec00c1f4678dae428ee98a2291a/numpy/_core/include/numpy/npy_common.h#L128-L140

which require some runtime checks in the meson configuration. Here's where that happens in NumPy's meson configuration:

https://github.com/numpy/numpy/blob/908e468aff6e6ec00c1f4678dae428ee98a2291a/numpy/_core/meson.build#L268-L292

Sorry for the trouble, I didn't realize there was this wrinkle around using NPY_TLS until now.

@SwayamInSync
Copy link
Member Author

SwayamInSync commented Sep 4, 2025

I guess in order to actually use NPY_TLS in C, you also need to define these HAVE_ variables defined:
numpy/numpy@908e468/numpy/_core/include/numpy/npy_common.h#L128-L140

This make sense, I was thinking my code-editor is lazy that I wasn't able to see the definition of NPY_TLS.
but this way NPY_TLS should be already defined as we install numpy to use its public includes and probably it'll be expanding to nothing on CI (making it a normal global variable)

Okay I can try changing the logic of dragon4.c to work without defining a global TLS variable.

@ngoldbaum
Copy link
Member

Okay I can try changing the logic of dragon4.c to work without defining a global TLS variable.

No, please don't do that. We should just fix the meson configuration so it defines the thread-local annotation correctly.

@ngoldbaum
Copy link
Member

I think we require C11, don't we? Maybe you can just use _Thread_local unconditionally?

@SwayamInSync
Copy link
Member Author

I think we require C11, don't we? Maybe you can just use _Thread_local unconditionally?

I made the changes in meson and it was working as expected locally, detecting the type and setting up the macro, lets see for CI

@SwayamInSync
Copy link
Member Author

 ../numpy_quaddtype/src/dragon4.c:34:2: warning: "NPY_TLS Thread-local storage support detected." [-W#warnings]
  #warning "NPY_TLS Thread-local storage support detected.

Okay so it seems to be expanding correctly but issue remains (after removing mutex)

@ngoldbaum
Copy link
Member

Nice, everything seems to be working with this latest push. I'd say go ahead and merge this. I would also do a bugfix release for this.

@SwayamInSync
Copy link
Member Author

I would also do a bugfix release for this.

Yeah so about that, we need to do some tweaks with meson or add explicitly in toml file, in the previous sdist - submodules, LISCENSE and sleef are not packaged. So this is causing some issues integrating with conda-forge

I checked meson docs that we can define the extra things we want to include in the toml file. So we can also fix this part and then make the release

@ngoldbaum
Copy link
Member

Makes sense. I hope this has been a fun learning experience in Python Packaging for you 😀

@SwayamInSync SwayamInSync marked this pull request as ready for review September 4, 2025 19:24
@SwayamInSync
Copy link
Member Author

I think its ready to merge now @ngoldbaum

@ngoldbaum
Copy link
Member

OK cool merging. I'm not sure if you all decided that this is a bug in SLEEF. If you think it is, it would also be awesome if one of you could report a bug to the upstream SLEEF project.

@ngoldbaum ngoldbaum merged commit 380fb83 into numpy:main Sep 4, 2025
7 checks passed
@SwayamInSync
Copy link
Member Author

OK cool merging. I'm not sure if you all decided that this is a bug in SLEEF. If you think it is, it would also be awesome if one of you could report a bug to the upstream SLEEF project.

Atleast for the 3.8 the way they defined SLEEF_QUAD_DENORM_MIN is incorrect,
I managed to get the exact value via sleef_q(+0x0000000000000LL, 0x0000000000000001ULL, -16383) so this seems to be the correct definition.

To check in the current latest version I need to compile it and then verify whether it is same or fixed; Github code of sleef is pretty unreadable as "everything dispatches" during build time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

numpy_quaddtype smallest_subnormal is wrong
3 participants