You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: use_mixed_precision actually emits fp32 FFT for light profiles
The flag previously forced fp64 FFT internally in convolved_image_from and
only downcast at the end -- a net loss on consumer GPUs (mp full pipeline
27% slower than fp64 on RTX 2060). The light-profile FFT path now runs
end-to-end complex64, with the kernel pre-cached on
ConvolverState.fft_kernel_c64 to keep the per-call astype out of CPU
profiles.
convolved_mapping_matrix_from intentionally keeps the fp64 kernel multiply
to preserve pixelization figure_of_merit precision: full fp32 in that path
caused 1.9% relative drift on the autolens_workspace_test delaunay_mge
regression (K=780 source mesh). The fp32 input cube and forward rfft2 are
kept for the cheaper scatter and FFT, but the multiply upcasts back to
complex128 and the irfft2 returns fp64.
Empirical impact (RTX 2060 + i9-10885H, mge.py HST-shaped regression):
GPU mp full pipeline: 47 -> 19.6 ms
GPU mp vmap (production hot path): 18 -> 8.9 ms (49% faster)
CPU vmap: ~unchanged
Delta log-likelihood: 2.2e-3 absolute, far below chi2 noise floor
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
0 commit comments