Skip to content

feat: CSV-first model API for cluster scripts #187

@Jammy2211

Description

@Jammy2211

Overview

Make CSV the first-class API for cluster lens modeling end to end. Today only the scaling-tier members live in a CSV (scaling_galaxies.csv); main galaxies, the host halo, and source profiles are still composed inline in Python. We'll extend the CSV layer to cover every cluster component, add a new scripts/cluster/csv_api.py guide that round-trips a full cluster model through CSVs, and refactor the existing 3 cluster scripts so the canonical workflow is "edit a CSV, re-run". The CSV layer becomes the definition of a cluster lens model.

Plan

  • Library helpers (PyAutoGalaxy): extend autogalaxy/galaxy/galaxy_table.py (or sibling module) with readers/writers for named-galaxy CSVs keyed by a galaxy column and a profile_class column. One CSV per profile family — mass, light, point. The reader returns a typed result the workspace can feed into af.Collection(...); the writer takes a list of (galaxy_name, profile) and emits the family CSV. Profile-class lookup goes through al.mp.* / al.lp.* / al.ps.* via getattr. Sparse columns supported (different profile classes use different parameter columns; empty cells are tolerated).
  • scripts/cluster/csv_api.py — new canonical workspace guide. Builds a small cluster model end-to-end in plain Python, writes it out to the family of CSVs, loads it back from those CSVs, prints the round-trip diff so every column has a visible Python counterpart. Also documents point_datasets.csv (existing) in the same place — convention is "load dataset before modeling", so the CSV story is presented in that order.
  • simulator.py — load the truth model from the CSVs csv_api.py writes (rather than hardcoded inline). The simulator then writes data.fits + point_datasets.csv + copies of the model CSVs into the dataset folder so downstream scripts pick them up. This becomes the auto-simulate chain.
  • modeling.py — refactor model composition to happen entirely from the model CSVs (main mass, main light, host halo, source light, source point, scaling tier). The inline af.Model(...) composition is shown as the alternative for users who prefer it, kept short.
  • start_here.py — mirror modeling.py from CSVs, keeping the intro-script tone (Beta-feature warning, JAX speed pitch, Google Colab setup, simpler prose).
  • Workspace audit — scan autolens_workspace for other CSV usage to confirm we're not breaking anything else; per the prompt the only existing case is scaling_galaxies.csv, verify.
  • Schema decisions to lock during implementation:
    • Column convention for tuple parameters (ell_comps, etc.) — name_y / name_x vs name_0 / name_1.
    • Whether the host halo (NFWMCRLudlowSph with redshift_object / redshift_source) goes in main_lens_mass.csv or its own host_halo.csv — likely the latter since its parameter set is fundamentally different from cluster-member dPIE mass.
    • Whether scaling-tier members keep the existing 3-column y, x, luminosity schema or migrate to the new galaxy, profile_class, y, x, luminosity, ... schema. Probably keep existing — the scaling tier is implicitly one profile class per member and naming each scaling galaxy is more overhead than signal.
Detailed implementation plan

Affected Repositories

  • PyAutoGalaxy (library — helpers + tests)
  • autolens_workspace (workspace — new guide + refactor of 3 cluster scripts)

Work Classification

Both — library ships first (PyAutoGalaxy PR), workspace follows after the library lands.

Branch Survey

Repository Current Branch Dirty?
PyAutoGalaxy main clean
autolens_workspace main clean (dataset/* artifacts ignored)

Suggested branch: feature/cluster-csv-api
Worktree root: ~/Code/PyAutoLabs-wt/cluster-csv-api/

Implementation Steps

  1. PyAutoGalaxy: library helpers. In autogalaxy/galaxy/galaxy_table.py (or split into a new galaxy_csv.py sibling), add:
    • galaxy_models_to_csv(galaxy_profile_pairs, file_path, profile_family) — writes one CSV per family. Each row carries galaxy, profile_class, <param columns…>, redshift?.
    • galaxy_models_from_csv(file_path) — returns a typed GalaxyModelTable carrying the rows grouped by galaxy name with profile_class dispatched to the right al.mp / al.lp / al.ps class.
    • Helper for stitching multiple family CSVs into an af.Collection (or list of af.Model(Galaxy)) keyed by the shared galaxy column.
    • Sparse-column tolerance: rows in the same CSV can use different profile classes with non-overlapping parameter columns; the reader only consumes the columns the row's class needs.
  2. PyAutoGalaxy: tests. Round-trip unit tests in test_autogalaxy/galaxy/test_galaxy_table.py (or test_galaxy_csv.py) covering: single-profile-class round-trip, multi-class sparse-column round-trip, missing-column rejection, name-join across families.
  3. autolens_workspace: scripts/cluster/csv_api.py — new guide demonstrating the round-trip with prose for every section.
  4. autolens_workspace: simulator/modeling/start_here refactor — load model CSVs via the new library helpers; simulator emits both the model CSVs and the derived point_datasets.csv into the dataset folder.
  5. Auto-sim guard — modeling/start_here check both data.fits and the model CSVs; trigger simulator if missing.
  6. Workspace auditgrep -rn '\.csv' autolens_workspace/scripts/ to confirm we haven't missed another CSV consumer.

Key Files

  • PyAutoGalaxy/autogalaxy/galaxy/galaxy_table.py (or new sibling) — library helpers.
  • PyAutoGalaxy/test_autogalaxy/galaxy/test_galaxy_table.py — round-trip tests.
  • PyAutoLens/autolens/__init__.py — re-export new helpers under al.*.
  • autolens_workspace/scripts/cluster/csv_api.py — new guide.
  • autolens_workspace/scripts/cluster/simulator.py — load truth from CSVs.
  • autolens_workspace/scripts/cluster/modeling.py — compose model from CSVs.
  • autolens_workspace/scripts/cluster/start_here.py — compose model from CSVs.

Open Design Questions (to resolve during implementation)

  • Tuple parameter column convention (name_y / name_x vs name_0 / name_1).
  • Whether NFWMCRLudlowSph host halo gets its own CSV.
  • Whether scaling-tier CSV format migrates or stays.
  • How to encode priors vs truth values in CSV cells when the simulator and modeling scripts share the same family CSV (the simulator wants concrete floats; modeling wants UniformPrior / fixed values — likely the CSV carries only concrete values, and modeling promotes selected columns to priors at compose time).

Original Prompt

Click to expand starting prompt

For autolens_workspace/scripts/cluster/simulator.py, can we make it so that all parameters are in .csv form
and loaded from there, to establish that the base way to interact with the autolens API for clusters
is via csv.

I thin kthe way to make this work is to write a guide, autolens_workspace/scripts/cluster/csv_api.py,
which illustrates how to set up lens models using the normal autolens API and then output them to csv,
and showing how all those featres linked together.

This csv file can then act as an "Auto Simulate" type siutation for simulator.py, which loads the csv outputsof this file.
The simulator will put the csv files int he lens it simulates at the end, meaning other scripts only need the
auto simulate performed here.

Things I am still unclear on that this guide could help with are:

Shouild main galaxies, extra galaxies and scaling galaies use their own csv files or can they all be combined into one?
I think it would be good if they could all be combined into one, but hiswould mean the .csv needs to know a lot more
than just parameters, but feasible mass profile class, lens name (e.g. when its used to name light and mass profiles in the Galaxy), redshift and
others. I think I like the idea of a single .csv API being used for all cluster interfaces.

The flip side is this could get complicated because if the same galaxy has light and mass profiles then the notion of column
heads breaks down, so maybe the rule is "one csv file per light or mass profile", and the reuse of light profile names, mass profile
names and galaxy names is exploited when building the model? Most cluster models will apply the same thing over loads
of galaxies so I think that works, so we can just build it in an extensible way.

This would mean it also needs the galaxy names, even though in simulator.py galaxies are not named when used in a Tracer
these names would be used for performing model composition, again the csv_api.py script could explain and cover this.

I would then go so far as to make it so that this guide also explains point_datasets.csv, which functionally looks a lot
more complete to me and just needs an explanation. I would explain this before doing galaxy API, as convention is normally load
dataset before modeling. Note that point_datasets.csv itself is made by simulator.py, thus I think csv_api.py
can just make an example one which is not paired to the model in the guide making it clear its for illustrative purposes
but that simultor.py makes the actual one.

There are also no .csv's used at all for defining the point source model, which obviously need to be paired
with the point dataset.

Source-side composition (the SersicCore light + Point model loops in both simulator.py and the JAX registration mirror) should also be migrated to two CSV files (source_point_models.csv and source_light_models.csv), making the whole cluster experience a first-class csv experience.

I guess at this point users should easily not just load .csv's into models but also be able to print the csv
contents in python / Notebook cells and have clear print statements of the loaded objects showing how their
csv load parameters link to autolens objects, I guess the csv_api does that.

Finally, do a quick scan through autolens_workspace of other csv uses but I think at the moment its just scaling galaxies
which are already implemented well.

Do deep research when coming to this csv API and once you're happy with it make csv interface the version used on all 3 cluster
scripts that exist. This is a huge issue -- the csv interface defines cluster modeling throughout, so dont be afraid
to ask hard questions about balancing the need for modeling large amoiunts of gaalxies to making a user friendly API.
Feel free to ask if some of this work would benefit going into the source code more so than it alredy has.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions