From 18546ea4fdfc7715ebf0988bdcbe6862e10d3e3e Mon Sep 17 00:00:00 2001
From: Jammy2211 <JNightingale2211@gmail.com>
Date: Sun, 24 May 2026 18:17:57 +0100
Subject: [PATCH] =?UTF-8?q?likelihood=5Fruntime:=20A100=20imaging=20suite?=
 =?UTF-8?q?=20(4=20instruments=20=C3=97=203=20cells)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds per-instrument A100 runtime sweep across the 4 imaging instrument
presets (hst, jwst, ao, euclid) and 3 imaging cells (delaunay, mge,
pixelization) at fp64 + mp — 24 cells total, all green on A100.

OPTIMIZATION_NOTES.md gains a "Per-instrument A100 runtime sweep
(2026-05-24)" subsection inside each of the 3 imaging cell sections,
plus a new top-level `imaging/mge` section (previously missing — only
the breakdown variant had been documented).

Key findings:

- imaging/mge is the fastest cell in the entire imaging sweep
  at ~6 ms/call across all 4 instruments. Pure analytical light + mass
  + Gaussian basis convolution — no mesh construction, no sparse
  operator, no large FFT.

- imaging/pixelization is essentially instrument-INDEPENDENT
  at ~53 ms/call (~4% spread across hst/jwst/ao/euclid). The 35×35
  rectangular source mesh dominates the per-call FFT budget; data-side
  mask shape barely matters. This is the inverse of the interferometer
  result, where mask-FFT extent drove a 6× per-call spread.

- imaging/delaunay sits in the middle at 77-86 ms/call (~12%
  spread). Hilbert-mesh + triangulation cost dominates.

mp is a small consistent win (~4-5%) on imaging/pixelization across
all 4 instruments, a wash on imaging/delaunay and imaging/mge.

Status table flips HPC A100 imaging suite from missing to ✅ run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 likelihood_runtime/OPTIMIZATION_NOTES.md | 65 ++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/likelihood_runtime/OPTIMIZATION_NOTES.md b/likelihood_runtime/OPTIMIZATION_NOTES.md
index 38fb49f..ce39aba 100644
--- a/likelihood_runtime/OPTIMIZATION_NOTES.md
+++ b/likelihood_runtime/OPTIMIZATION_NOTES.md
@@ -24,6 +24,7 @@ follow-up (see the bottom of this doc).
 | HPC A100 alma (4 cells)          | ✅ run 2026-05-22 — unblocked by PyAutoArray#329 (apply_sparse_operator now accepts TransformerNUFFT) |
 | HPC A100 alma_high (4 cells)     | ✅ run 2026-05-22 — unblocked by PyAutoArray#330 (TransformerNUFFT chunk_size knob caps the nufftax gather buffer) |
 | HPC A100 jvla (2 cells, stretch) | ✅ run 2026-05-24 — interferometer/delaunay only (25M vis, 700-px mask, pixel_scale=0.01). No fix needed; chunked NUFFT + W-Tilde sparse path held. |
+| HPC A100 imaging suite (24 cells) | ✅ run 2026-05-24 — {delaunay, mge, pixelization} × {hst, jwst, ao, euclid} × {fp64, mp}. Per-instrument runtime tables in each cell's section below. |
 | Imaging cells fresh CPU/GPU      | ⚠ blocked by upstream `Grid2DIrregular.mask` bug — table rows show the pre-existing v2026.5.8.2 / v2026.5.14.2 data |
 
 ## Headline numbers (full pipeline, single JIT per call)
@@ -138,6 +139,31 @@ PyAutoLens v2026.5.14.2. The pre-existing v2026.5.8.2 sweep data in
 **mp verdict** — modest on CPU (~14 % win), neutral elsewhere.
 **Useful only at CPU scale**; skip on GPU.
 
+### Per-instrument A100 runtime sweep (2026-05-24)
+
+Full-pipeline single-JIT cost per likelihood call across the 4 imaging
+instrument presets. Same model, same rectangular mesh, same regularization
+— only the dataset's pixel_scale (and hence mask shape) changes.
+
+| Instrument | pixel_scale | mask shape (px) | fp64 | mp |
+|------------|-------------|------------------|------|-----|
+| hst        | 0.05        | 140 × 140        | 53 ms | 51 ms |
+| jwst       | 0.03        | 234 × 234        | 53 ms | 51 ms |
+| ao         | 0.01        | 700 × 700        | 53 ms | 51 ms |
+| euclid     | 0.1         | 70 × 70          | 54 ms | 51 ms |
+
+**Imaging/pixelization is essentially instrument-INDEPENDENT.** The
+35×35 rectangular source mesh (1225 nodes) dominates the per-call FFT
+budget; the data-side mask shape barely matters. Going euclid → ao
+the data grid scales 100× in pixel count, but per-call time changes
+<2%. This is exactly the inverse of the interferometer/pixelization
+result, where mask-FFT extent drove a 6× per-call spread.
+
+**mp is uniformly a small win** (~4-5%) across all 4 imaging
+instruments on this cell, with no scaling story — fixed-size FFTs
+amortize the mixed-precision overhead the same way regardless of
+instrument.
+
 ---
 
 ## imaging/delaunay
@@ -176,6 +202,45 @@ CPU rows have no fresh measurement available.
 **mp verdict** — barely measurable (~5 % on GPU). Skip; it's not worth the
 correctness-budget pressure.
 
+### Per-instrument A100 runtime sweep (2026-05-24)
+
+| Instrument | pixel_scale | mask shape (px) | fp64 | mp |
+|------------|-------------|------------------|------|-----|
+| hst        | 0.05        | 140 × 140        | 86 ms | 90 ms |
+| jwst       | 0.03        | 234 × 234        | 80 ms | 78 ms |
+| ao         | 0.01        | 700 × 700        | 85 ms | 80 ms |
+| euclid     | 0.1         | 70 × 70          | 77 ms | 78 ms |
+
+**Delaunay also instrument-quasi-independent on A100** (77-86 ms
+spread, ~12% variation). Hilbert-mesh + triangulation cost dominates;
+the source-plane mesh has ~1000 nodes regardless of instrument.
+**mp is a wash** on this cell across all 4 instruments.
+
+---
+
+## imaging/mge
+
+*MGE-decomposed source (Gaussian basis, ~25 Gaussians). Isothermal +
+ExternalShear lens. Lowest per-call cost in the imaging suite.*
+
+### Per-instrument A100 runtime sweep (2026-05-24)
+
+| Instrument | pixel_scale | mask shape (px) | fp64 | mp |
+|------------|-------------|------------------|------|-----|
+| hst        | 0.05        | 140 × 140        | 5.8 ms | 6.4 ms |
+| jwst       | 0.03        | 234 × 234        | 6.0 ms | 6.0 ms |
+| ao         | 0.01        | 700 × 700        | 6.0 ms | 5.8 ms |
+| euclid     | 0.1         | 70 × 70          | 5.9 ms | 5.8 ms |
+
+**Fastest cell in the entire imaging-side sweep at ~6 ms / call.**
+Analytical light + parametric mass + Gaussian basis convolution — no
+mesh construction, no sparse-operator setup, no large FFT. Per-call
+cost is essentially constant across all 4 instruments (the data grid
+shape barely registers).
+
+**mp is a wash** at this scale — the kernel is too small for
+mixed-precision matmul gains to surface.
+
 ---
 
 ## interferometer/mge