From 18546ea4fdfc7715ebf0988bdcbe6862e10d3e3e Mon Sep 17 00:00:00 2001 From: Jammy2211 Date: Sun, 24 May 2026 18:17:57 +0100 Subject: [PATCH] =?UTF-8?q?likelihood=5Fruntime:=20A100=20imaging=20suite?= =?UTF-8?q?=20(4=20instruments=20=C3=97=203=20cells)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds per-instrument A100 runtime sweep across the 4 imaging instrument presets (hst, jwst, ao, euclid) and 3 imaging cells (delaunay, mge, pixelization) at fp64 + mp — 24 cells total, all green on A100. OPTIMIZATION_NOTES.md gains a "Per-instrument A100 runtime sweep (2026-05-24)" subsection inside each of the 3 imaging cell sections, plus a new top-level `imaging/mge` section (previously missing — only the breakdown variant had been documented). Key findings: - imaging/mge is the fastest cell in the entire imaging sweep at ~6 ms/call across all 4 instruments. Pure analytical light + mass + Gaussian basis convolution — no mesh construction, no sparse operator, no large FFT. - imaging/pixelization is essentially instrument-INDEPENDENT at ~53 ms/call (~4% spread across hst/jwst/ao/euclid). The 35×35 rectangular source mesh dominates the per-call FFT budget; data-side mask shape barely matters. This is the inverse of the interferometer result, where mask-FFT extent drove a 6× per-call spread. - imaging/delaunay sits in the middle at 77-86 ms/call (~12% spread). Hilbert-mesh + triangulation cost dominates. mp is a small consistent win (~4-5%) on imaging/pixelization across all 4 instruments, a wash on imaging/delaunay and imaging/mge. Status table flips HPC A100 imaging suite from missing to ✅ run. Co-Authored-By: Claude Opus 4.7 --- likelihood_runtime/OPTIMIZATION_NOTES.md | 65 ++++++++++++++++++++++++ 1 file changed, 65 insertions(+) diff --git a/likelihood_runtime/OPTIMIZATION_NOTES.md b/likelihood_runtime/OPTIMIZATION_NOTES.md index 38fb49f..ce39aba 100644 --- a/likelihood_runtime/OPTIMIZATION_NOTES.md +++ b/likelihood_runtime/OPTIMIZATION_NOTES.md @@ -24,6 +24,7 @@ follow-up (see the bottom of this doc). | HPC A100 alma (4 cells) | ✅ run 2026-05-22 — unblocked by PyAutoArray#329 (apply_sparse_operator now accepts TransformerNUFFT) | | HPC A100 alma_high (4 cells) | ✅ run 2026-05-22 — unblocked by PyAutoArray#330 (TransformerNUFFT chunk_size knob caps the nufftax gather buffer) | | HPC A100 jvla (2 cells, stretch) | ✅ run 2026-05-24 — interferometer/delaunay only (25M vis, 700-px mask, pixel_scale=0.01). No fix needed; chunked NUFFT + W-Tilde sparse path held. | +| HPC A100 imaging suite (24 cells) | ✅ run 2026-05-24 — {delaunay, mge, pixelization} × {hst, jwst, ao, euclid} × {fp64, mp}. Per-instrument runtime tables in each cell's section below. | | Imaging cells fresh CPU/GPU | ⚠ blocked by upstream `Grid2DIrregular.mask` bug — table rows show the pre-existing v2026.5.8.2 / v2026.5.14.2 data | ## Headline numbers (full pipeline, single JIT per call) @@ -138,6 +139,31 @@ PyAutoLens v2026.5.14.2. The pre-existing v2026.5.8.2 sweep data in **mp verdict** — modest on CPU (~14 % win), neutral elsewhere. **Useful only at CPU scale**; skip on GPU. +### Per-instrument A100 runtime sweep (2026-05-24) + +Full-pipeline single-JIT cost per likelihood call across the 4 imaging +instrument presets. Same model, same rectangular mesh, same regularization +— only the dataset's pixel_scale (and hence mask shape) changes. + +| Instrument | pixel_scale | mask shape (px) | fp64 | mp | +|------------|-------------|------------------|------|-----| +| hst | 0.05 | 140 × 140 | 53 ms | 51 ms | +| jwst | 0.03 | 234 × 234 | 53 ms | 51 ms | +| ao | 0.01 | 700 × 700 | 53 ms | 51 ms | +| euclid | 0.1 | 70 × 70 | 54 ms | 51 ms | + +**Imaging/pixelization is essentially instrument-INDEPENDENT.** The +35×35 rectangular source mesh (1225 nodes) dominates the per-call FFT +budget; the data-side mask shape barely matters. Going euclid → ao +the data grid scales 100× in pixel count, but per-call time changes +<2%. This is exactly the inverse of the interferometer/pixelization +result, where mask-FFT extent drove a 6× per-call spread. + +**mp is uniformly a small win** (~4-5%) across all 4 imaging +instruments on this cell, with no scaling story — fixed-size FFTs +amortize the mixed-precision overhead the same way regardless of +instrument. + --- ## imaging/delaunay @@ -176,6 +202,45 @@ CPU rows have no fresh measurement available. **mp verdict** — barely measurable (~5 % on GPU). Skip; it's not worth the correctness-budget pressure. +### Per-instrument A100 runtime sweep (2026-05-24) + +| Instrument | pixel_scale | mask shape (px) | fp64 | mp | +|------------|-------------|------------------|------|-----| +| hst | 0.05 | 140 × 140 | 86 ms | 90 ms | +| jwst | 0.03 | 234 × 234 | 80 ms | 78 ms | +| ao | 0.01 | 700 × 700 | 85 ms | 80 ms | +| euclid | 0.1 | 70 × 70 | 77 ms | 78 ms | + +**Delaunay also instrument-quasi-independent on A100** (77-86 ms +spread, ~12% variation). Hilbert-mesh + triangulation cost dominates; +the source-plane mesh has ~1000 nodes regardless of instrument. +**mp is a wash** on this cell across all 4 instruments. + +--- + +## imaging/mge + +*MGE-decomposed source (Gaussian basis, ~25 Gaussians). Isothermal + +ExternalShear lens. Lowest per-call cost in the imaging suite.* + +### Per-instrument A100 runtime sweep (2026-05-24) + +| Instrument | pixel_scale | mask shape (px) | fp64 | mp | +|------------|-------------|------------------|------|-----| +| hst | 0.05 | 140 × 140 | 5.8 ms | 6.4 ms | +| jwst | 0.03 | 234 × 234 | 6.0 ms | 6.0 ms | +| ao | 0.01 | 700 × 700 | 6.0 ms | 5.8 ms | +| euclid | 0.1 | 70 × 70 | 5.9 ms | 5.8 ms | + +**Fastest cell in the entire imaging-side sweep at ~6 ms / call.** +Analytical light + parametric mass + Gaussian basis convolution — no +mesh construction, no sparse-operator setup, no large FFT. Per-call +cost is essentially constant across all 4 instruments (the data grid +shape barely registers). + +**mp is a wash** at this scale — the kernel is too small for +mixed-precision matmul gains to surface. + --- ## interferometer/mge