|
1 | 1 | # Session Handoff: Chunked CLARA + Float32 Views + ARC SLURM (2026-04-08) |
2 | 2 |
|
3 | | -**One-liner:** `--ram-limit` wired into chunked CLARA with Parquet row-group streaming, float32 view-mode, OpenMP-parallel chunked assignment, ARC SLURM build profiles for full GPU fleet (P100→H100). |
| 3 | +`--ram-limit` wired into chunked CLARA with Parquet row-group streaming, float32 view-mode, OpenMP-parallel chunked assignment, ARC SLURM build profiles for full GPU fleet (P100 through H100), HiGHS upgraded to 1.14.0. |
4 | 4 |
|
5 | | -## What Was Done |
| 5 | +## What Was Done (3 commits on Claude branch) |
6 | 6 |
|
7 | | -1. **Float32 view-mode** — `Data` now supports `span<const float>` views via `p_spans_f32_`. CLARA subsampling works with float32 data (zero-copy). |
8 | | -2. **ParquetChunkReader** — new `dtwc/io/parquet_chunk_reader.hpp`: row-group streaming, sparse row access (`read_rows()`), RAM budget calculation, float32 path (`read_row_groups_f32`). |
9 | | -3. **Chunked CLARA** — `assign_all_points_chunked()` and `assign_all_points_chunked_f32()` stream row groups within RAM budget. `fast_clara_chunked()` loads subsamples + medoids from Parquet on demand. OpenMP `parallel for` on inner loop. |
10 | | -4. **CLI wiring** — `--ram-limit` populates `CLARAOptions` for Parquet input. `--precision float32` triggers f32 chunked path. |
11 | | -5. **Adversarial review** — 2 agents (Opus code-reviewer + Codex). Fixed: float Parquet `static_pointer_cast<DoubleArray>` crash (High), int overflow at >2B rows (Medium), duplicate index double-move (Medium), bounds validation, `row_groups_per_batch` edge case, N-element copy per subsample replaced with `std::sample`. |
12 | | -6. **CUDA CMake** — `cmake_minimum_required(VERSION 3.26)`, CUDA C++20 now works. CI updated to `pip install cmake>=3.26`. |
13 | | -7. **ARC SLURM support** — default CUDA archs expanded to `60;70;75;80;86;89;90` (P100→H100). Build script `scripts/slurm/build-arc.sh` with 6 profiles (arc, htc-cpu, htc-gpu, htc-v4, h100, grace). |
14 | | -8. **HiGHS assertion fix** — HiGHS v1.13.1 debug `assert(ub_consistent)` fires on warm-start MIP. Fixed by adding `NDEBUG` compile definition to HiGHS target. 67/67 tests now pass. |
15 | | -9. **Docs** — CHANGELOG.md (Phase 4 features), README.md (feature list + `DTWC_ENABLE_ARROW`), LESSONS.md (Arrow/Parquet + ARC hardware notes). |
16 | | -10. **Cleanup** — deleted 8 older handoff files. |
| 7 | +1. **Float32 view-mode** — `Data` supports `span<const float>` views via `p_spans_f32_`. CLARA subsampling works with float32 data (zero-copy). New constructor, updated `size()`, `series_f32()`, `series_flat_size()`. |
| 8 | + |
| 9 | +2. **ParquetChunkReader** — new `dtwc/io/parquet_chunk_reader.hpp`. Row-group streaming, sparse row access (`read_rows()`), RAM budget calculation, float32 path (`read_row_groups_f32`). Handles both Float and Double Parquet columns. Thread-safety documented. |
| 10 | + |
| 11 | +3. **Chunked CLARA** — two new functions in `fast_clara.cpp`: |
| 12 | + - `assign_all_points_chunked()` / `_f32()` — stream row groups within RAM budget, OpenMP `parallel for` on inner loop |
| 13 | + - `fast_clara_chunked()` — loads subsamples + medoids from Parquet on demand via `std::sample` (O(sample_size), not O(N)) |
| 14 | + - Dispatch in `fast_clara()` routes to chunked mode when `ram_limit > estimated_data_size` |
| 15 | + |
| 16 | +4. **CLI wiring** — `--ram-limit` populates `CLARAOptions` for Parquet input. `--precision float32` triggers f32 chunked path. Warning printed for non-Parquet input. |
| 17 | + |
| 18 | +5. **Adversarial review** — 2 Opus agents + 1 Codex review. Bugs fixed: |
| 19 | + - *High*: float Parquet `static_pointer_cast<DoubleArray>` — now checks value type |
| 20 | + - *Medium*: int overflow at >2B rows — uses `int64_t` + `mt19937_64` |
| 21 | + - *Medium*: `read_rows()` double-move on duplicates — uses copy |
| 22 | + - *Low*: bounds validation, `row_groups_per_batch` edge case |
| 23 | + |
| 24 | +6. **CUDA CMake** — `cmake_minimum_required(VERSION 3.26)` across all CMakeLists. CUDA C++20 now works. CI updated to `pip install cmake>=3.26`. |
| 25 | + |
| 26 | +7. **ARC SLURM support** — CUDA archs expanded to `60;70;75;80;86;89;90` (P100 through H100). Build script `scripts/slurm/build-arc.sh` with 6 profiles: `arc`, `htc-cpu`, `htc-gpu`, `htc-v4`, `h100`, `grace`. |
| 27 | + |
| 28 | +8. **HiGHS v1.14.0** — upgraded from v1.13.1. Debug `assert(ub_consistent)` in primal-dual integral tracking still fires on warm-start MIP in both versions. Verified by Codex GPT-5.4 (xhigh reasoning): not a solution-correctness bug — it is bookkeeping for a performance metric. Workaround: `NDEBUG` compile def on HiGHS target. Needs proper upstream fix (see LESSONS.md). |
| 29 | + |
| 30 | +9. **Documentation** — CHANGELOG.md (Phase 4 features), README.md (feature list, `DTWC_ENABLE_ARROW` option), LESSONS.md (Arrow/Parquet, ARC hardware, HiGHS workaround). |
| 31 | + |
| 32 | +10. **Cleanup** — deleted 8 obsolete handoff files. |
17 | 33 |
|
18 | 34 | ## Current State |
19 | 35 |
|
20 | | -- **Branch:** Claude |
| 36 | +- **Branch:** Claude (3 commits ahead of origin/Claude) |
21 | 37 | - **Tests:** 67/67 pass, 2 CUDA skipped |
22 | 38 | - **Build:** Clang 21, C++20, Ninja, Windows 11 |
23 | 39 |
|
24 | | -## Files Changed |
| 40 | +## Key Files |
25 | 41 |
|
26 | | -### New files |
27 | | -- `dtwc/io/parquet_chunk_reader.hpp` — row-group streaming reader |
28 | | -- `scripts/slurm/build-arc.sh` — ARC SLURM build profiles |
29 | | - |
30 | | -### Modified |
31 | | -- `dtwc/Data.hpp` — `p_spans_f32_`, f32 view constructor, updated accessors |
32 | | -- `dtwc/Problem.hpp` — `dtw_function()` / `dtw_function_f32()` public accessors |
33 | | -- `dtwc/algorithms/fast_clara.hpp` — `CLARAOptions` + ram_limit, parquet_path, use_float32 |
34 | | -- `dtwc/algorithms/fast_clara.cpp` — f32 subsample, chunked assignment (f64+f32), OpenMP, dispatch |
35 | | -- `dtwc/dtwc_cl.cpp` — `--ram-limit` wired into `clara_opts` |
36 | | -- `CMakeLists.txt` — cmake_minimum_required 3.26, CUDA archs 60-90 |
37 | | -- `.github/workflows/cuda-mpi-detect.yml` — pip install cmake>=3.26 |
38 | | -- `tests/unit/unit_test_Data.cpp` — float32 view-mode tests |
39 | | -- `tests/unit/algorithms/unit_test_fast_clara.cpp` — float32 CLARA test |
40 | | -- `CHANGELOG.md`, `README.md`, `.claude/TODO.md`, `.claude/LESSONS.md` |
| 42 | +| File | Role | |
| 43 | +|------|------| |
| 44 | +| `dtwc/io/parquet_chunk_reader.hpp` | Row-group streaming Parquet reader (new) | |
| 45 | +| `dtwc/algorithms/fast_clara.cpp` | Chunked CLARA + f32 paths + OpenMP | |
| 46 | +| `dtwc/algorithms/fast_clara.hpp` | `CLARAOptions` with ram_limit, parquet_path, use_float32 | |
| 47 | +| `dtwc/Data.hpp` | Float32 view-mode (`p_spans_f32_`) | |
| 48 | +| `dtwc/Problem.hpp` | `dtw_function()` / `dtw_function_f32()` accessors | |
| 49 | +| `dtwc/dtwc_cl.cpp` | `--ram-limit` wired into `clara_opts` | |
| 50 | +| `scripts/slurm/build-arc.sh` | ARC SLURM build profiles (new) | |
| 51 | +| `cmake/Dependencies.cmake` | HiGHS v1.14.0 + NDEBUG workaround | |
41 | 52 |
|
42 | 53 | ## What To Do Next |
43 | 54 |
|
44 | | -### Immediate (SLURM session) |
45 | | -1. SSH to ARC, `source scripts/slurm/build-arc.sh htc-gpu` — verify Arrow CPM build on Linux |
46 | | -2. Run on real battery Parquet data with `--ram-limit 2G --precision float32 --method clara` |
47 | | -3. Test H100 GPU path: `source scripts/slurm/build-arc.sh h100` |
| 55 | +### SLURM session |
| 56 | +1. Push branch, SSH to ARC |
| 57 | +2. `source scripts/slurm/build-arc.sh htc-gpu` — verify Arrow CPM build on Linux + full GPU fleet |
| 58 | +3. Run on real battery Parquet data: `dtwc --ram-limit 2G --precision float32 --method clara -k 10 battery.parquet` |
| 59 | +4. Test H100 GPU path: `source scripts/slurm/build-arc.sh h100` |
48 | 60 |
|
49 | 61 | ### Short-term |
50 | | -4. Integration test for chunked CLARA with synthetic Parquet file (no Parquet on Windows CI) |
51 | | -5. Sample size scaling: `sqrt(N)` for large N (current formula too small at 100M) |
52 | | -6. CLARA checkpointing: save/resume assignment state for long runs |
| 62 | +5. Integration test for chunked CLARA with synthetic Parquet file |
| 63 | +6. Sample size scaling: `sqrt(N)` for large N (current formula too small at 100M) |
| 64 | +7. CLARA checkpointing: save/resume assignment state for long runs |
| 65 | +8. File upstream HiGHS issue for `ub_consistent` assertion |
53 | 66 |
|
54 | 67 | ### Known Issues |
55 | | - |
56 | 68 | - Arrow CPM build on Windows+Clang: blocked by Arrow upstream ExternalProject flag quoting |
57 | 69 | - Grace Hopper (htc-g057): AArch64 CPU build untested, CUDA kernel not ported to ARM |
| 70 | +- HiGHS NDEBUG workaround is too blunt — suppresses all HiGHS assertions (see LESSONS.md) |
0 commit comments