Skip to content

Commit 59b00a2

Browse files
ElektrikAkarclaude
andcommitted
fix: upgrade HiGHS to v1.14.0, suppress debug assertion on warm-start MIP
HiGHS <=1.14.0 assert(ub_consistent) fires in updatePrimalDualIntegral() during warm-start MIP solves. This is a performance metric tracker, not solution correctness — prev_lb/prev_ub/prev_gap are documented "Only for checking/debugging". Presolve restart offset arithmetic introduces roundoff exceeding the 1e-12 tolerance. Verified by Codex GPT-5.4 (xhigh reasoning): not a solution bug. Workaround: NDEBUG on HiGHS target. Proper upstream fix needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 553c4d9 commit 59b00a2

3 files changed

Lines changed: 61 additions & 40 deletions

File tree

.claude/LESSONS.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,11 @@ Critical knowledge to avoid repeating mistakes.
5353
- **MATLAB + MSVC OpenMP: exit segfault in `-batch` mode.** Functionality works; segfault after output. Check output, not exit code.
5454
- **nanobind over pybind11.** Stable ABI, 5-10x smaller binaries, native CUDA ndarray. GIL release for >10ms calls.
5555

56+
## HiGHS MIP Solver (IMPORTANT — workaround in place)
57+
58+
- **HiGHS <=1.14.0 `assert(ub_consistent)` fires on warm-start MIP.** The assertion is in `updatePrimalDualIntegral()` — a performance metric tracker, NOT solution correctness. `prev_lb/prev_ub/prev_gap` are documented "Only for checking/debugging" (line 2802). The P-D integral is never used to accept/reject incumbents. Presolve restart rebases bounds with offset arithmetic that introduces roundoff exceeding the 1e-12 tolerance. **Current workaround:** `target_compile_definitions(highs PRIVATE NDEBUG)` in `cmake/Dependencies.cmake` — too blunt (suppresses ALL HiGHS assertions). **Proper fix needed:** patch HiGHS to skip `check_prev_data` after restart, or relax the tolerance in this specific block. File upstream issue at github.com/ERGO-Code/HiGHS.
59+
- **Verified by Codex (GPT-5.4, xhigh reasoning):** Not a solution-correctness bug. The workaround is legitimate short-term.
60+
5661
## Build System
5762

5863
- **CUDA multi-version on Windows:** Generate `Directory.Build.props` with `<CudaToolkitCustomDir>`.
Lines changed: 50 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,70 @@
11
# Session Handoff: Chunked CLARA + Float32 Views + ARC SLURM (2026-04-08)
22

3-
**One-liner:** `--ram-limit` wired into chunked CLARA with Parquet row-group streaming, float32 view-mode, OpenMP-parallel chunked assignment, ARC SLURM build profiles for full GPU fleet (P100H100).
3+
`--ram-limit` wired into chunked CLARA with Parquet row-group streaming, float32 view-mode, OpenMP-parallel chunked assignment, ARC SLURM build profiles for full GPU fleet (P100 through H100), HiGHS upgraded to 1.14.0.
44

5-
## What Was Done
5+
## What Was Done (3 commits on Claude branch)
66

7-
1. **Float32 view-mode**`Data` now supports `span<const float>` views via `p_spans_f32_`. CLARA subsampling works with float32 data (zero-copy).
8-
2. **ParquetChunkReader** — new `dtwc/io/parquet_chunk_reader.hpp`: row-group streaming, sparse row access (`read_rows()`), RAM budget calculation, float32 path (`read_row_groups_f32`).
9-
3. **Chunked CLARA**`assign_all_points_chunked()` and `assign_all_points_chunked_f32()` stream row groups within RAM budget. `fast_clara_chunked()` loads subsamples + medoids from Parquet on demand. OpenMP `parallel for` on inner loop.
10-
4. **CLI wiring**`--ram-limit` populates `CLARAOptions` for Parquet input. `--precision float32` triggers f32 chunked path.
11-
5. **Adversarial review** — 2 agents (Opus code-reviewer + Codex). Fixed: float Parquet `static_pointer_cast<DoubleArray>` crash (High), int overflow at >2B rows (Medium), duplicate index double-move (Medium), bounds validation, `row_groups_per_batch` edge case, N-element copy per subsample replaced with `std::sample`.
12-
6. **CUDA CMake**`cmake_minimum_required(VERSION 3.26)`, CUDA C++20 now works. CI updated to `pip install cmake>=3.26`.
13-
7. **ARC SLURM support** — default CUDA archs expanded to `60;70;75;80;86;89;90` (P100→H100). Build script `scripts/slurm/build-arc.sh` with 6 profiles (arc, htc-cpu, htc-gpu, htc-v4, h100, grace).
14-
8. **HiGHS assertion fix** — HiGHS v1.13.1 debug `assert(ub_consistent)` fires on warm-start MIP. Fixed by adding `NDEBUG` compile definition to HiGHS target. 67/67 tests now pass.
15-
9. **Docs** — CHANGELOG.md (Phase 4 features), README.md (feature list + `DTWC_ENABLE_ARROW`), LESSONS.md (Arrow/Parquet + ARC hardware notes).
16-
10. **Cleanup** — deleted 8 older handoff files.
7+
1. **Float32 view-mode**`Data` supports `span<const float>` views via `p_spans_f32_`. CLARA subsampling works with float32 data (zero-copy). New constructor, updated `size()`, `series_f32()`, `series_flat_size()`.
8+
9+
2. **ParquetChunkReader** — new `dtwc/io/parquet_chunk_reader.hpp`. Row-group streaming, sparse row access (`read_rows()`), RAM budget calculation, float32 path (`read_row_groups_f32`). Handles both Float and Double Parquet columns. Thread-safety documented.
10+
11+
3. **Chunked CLARA** — two new functions in `fast_clara.cpp`:
12+
- `assign_all_points_chunked()` / `_f32()` — stream row groups within RAM budget, OpenMP `parallel for` on inner loop
13+
- `fast_clara_chunked()` — loads subsamples + medoids from Parquet on demand via `std::sample` (O(sample_size), not O(N))
14+
- Dispatch in `fast_clara()` routes to chunked mode when `ram_limit > estimated_data_size`
15+
16+
4. **CLI wiring**`--ram-limit` populates `CLARAOptions` for Parquet input. `--precision float32` triggers f32 chunked path. Warning printed for non-Parquet input.
17+
18+
5. **Adversarial review** — 2 Opus agents + 1 Codex review. Bugs fixed:
19+
- *High*: float Parquet `static_pointer_cast<DoubleArray>` — now checks value type
20+
- *Medium*: int overflow at >2B rows — uses `int64_t` + `mt19937_64`
21+
- *Medium*: `read_rows()` double-move on duplicates — uses copy
22+
- *Low*: bounds validation, `row_groups_per_batch` edge case
23+
24+
6. **CUDA CMake**`cmake_minimum_required(VERSION 3.26)` across all CMakeLists. CUDA C++20 now works. CI updated to `pip install cmake>=3.26`.
25+
26+
7. **ARC SLURM support** — CUDA archs expanded to `60;70;75;80;86;89;90` (P100 through H100). Build script `scripts/slurm/build-arc.sh` with 6 profiles: `arc`, `htc-cpu`, `htc-gpu`, `htc-v4`, `h100`, `grace`.
27+
28+
8. **HiGHS v1.14.0** — upgraded from v1.13.1. Debug `assert(ub_consistent)` in primal-dual integral tracking still fires on warm-start MIP in both versions. Verified by Codex GPT-5.4 (xhigh reasoning): not a solution-correctness bug — it is bookkeeping for a performance metric. Workaround: `NDEBUG` compile def on HiGHS target. Needs proper upstream fix (see LESSONS.md).
29+
30+
9. **Documentation** — CHANGELOG.md (Phase 4 features), README.md (feature list, `DTWC_ENABLE_ARROW` option), LESSONS.md (Arrow/Parquet, ARC hardware, HiGHS workaround).
31+
32+
10. **Cleanup** — deleted 8 obsolete handoff files.
1733

1834
## Current State
1935

20-
- **Branch:** Claude
36+
- **Branch:** Claude (3 commits ahead of origin/Claude)
2137
- **Tests:** 67/67 pass, 2 CUDA skipped
2238
- **Build:** Clang 21, C++20, Ninja, Windows 11
2339

24-
## Files Changed
40+
## Key Files
2541

26-
### New files
27-
- `dtwc/io/parquet_chunk_reader.hpp` — row-group streaming reader
28-
- `scripts/slurm/build-arc.sh` — ARC SLURM build profiles
29-
30-
### Modified
31-
- `dtwc/Data.hpp``p_spans_f32_`, f32 view constructor, updated accessors
32-
- `dtwc/Problem.hpp``dtw_function()` / `dtw_function_f32()` public accessors
33-
- `dtwc/algorithms/fast_clara.hpp``CLARAOptions` + ram_limit, parquet_path, use_float32
34-
- `dtwc/algorithms/fast_clara.cpp` — f32 subsample, chunked assignment (f64+f32), OpenMP, dispatch
35-
- `dtwc/dtwc_cl.cpp``--ram-limit` wired into `clara_opts`
36-
- `CMakeLists.txt` — cmake_minimum_required 3.26, CUDA archs 60-90
37-
- `.github/workflows/cuda-mpi-detect.yml` — pip install cmake>=3.26
38-
- `tests/unit/unit_test_Data.cpp` — float32 view-mode tests
39-
- `tests/unit/algorithms/unit_test_fast_clara.cpp` — float32 CLARA test
40-
- `CHANGELOG.md`, `README.md`, `.claude/TODO.md`, `.claude/LESSONS.md`
42+
| File | Role |
43+
|------|------|
44+
| `dtwc/io/parquet_chunk_reader.hpp` | Row-group streaming Parquet reader (new) |
45+
| `dtwc/algorithms/fast_clara.cpp` | Chunked CLARA + f32 paths + OpenMP |
46+
| `dtwc/algorithms/fast_clara.hpp` | `CLARAOptions` with ram_limit, parquet_path, use_float32 |
47+
| `dtwc/Data.hpp` | Float32 view-mode (`p_spans_f32_`) |
48+
| `dtwc/Problem.hpp` | `dtw_function()` / `dtw_function_f32()` accessors |
49+
| `dtwc/dtwc_cl.cpp` | `--ram-limit` wired into `clara_opts` |
50+
| `scripts/slurm/build-arc.sh` | ARC SLURM build profiles (new) |
51+
| `cmake/Dependencies.cmake` | HiGHS v1.14.0 + NDEBUG workaround |
4152

4253
## What To Do Next
4354

44-
### Immediate (SLURM session)
45-
1. SSH to ARC, `source scripts/slurm/build-arc.sh htc-gpu` — verify Arrow CPM build on Linux
46-
2. Run on real battery Parquet data with `--ram-limit 2G --precision float32 --method clara`
47-
3. Test H100 GPU path: `source scripts/slurm/build-arc.sh h100`
55+
### SLURM session
56+
1. Push branch, SSH to ARC
57+
2. `source scripts/slurm/build-arc.sh htc-gpu` — verify Arrow CPM build on Linux + full GPU fleet
58+
3. Run on real battery Parquet data: `dtwc --ram-limit 2G --precision float32 --method clara -k 10 battery.parquet`
59+
4. Test H100 GPU path: `source scripts/slurm/build-arc.sh h100`
4860

4961
### Short-term
50-
4. Integration test for chunked CLARA with synthetic Parquet file (no Parquet on Windows CI)
51-
5. Sample size scaling: `sqrt(N)` for large N (current formula too small at 100M)
52-
6. CLARA checkpointing: save/resume assignment state for long runs
62+
5. Integration test for chunked CLARA with synthetic Parquet file
63+
6. Sample size scaling: `sqrt(N)` for large N (current formula too small at 100M)
64+
7. CLARA checkpointing: save/resume assignment state for long runs
65+
8. File upstream HiGHS issue for `ub_consistent` assertion
5366

5467
### Known Issues
55-
5668
- Arrow CPM build on Windows+Clang: blocked by Arrow upstream ExternalProject flag quoting
5769
- Grace Hopper (htc-g057): AArch64 CPU build untested, CUDA kernel not ported to ARM
70+
- HiGHS NDEBUG workaround is too blunt — suppresses all HiGHS assertions (see LESSONS.md)

cmake/Dependencies.cmake

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,14 +28,17 @@ function(dtwc_setup_dependencies)
2828
if(NOT TARGET highs::highs AND DTWC_ENABLE_HIGHS)# HiGHS library:
2929
CPMAddPackage(
3030
NAME highs
31-
URL "https://github.com/ERGO-Code/HiGHS/archive/refs/tags/v1.13.1.tar.gz"
31+
URL "https://github.com/ERGO-Code/HiGHS/archive/refs/tags/v1.14.0.tar.gz"
3232
SYSTEM
3333
EXCLUDE_FROM_ALL
3434
OPTIONS
3535
"CI OFF" "ZLIB OFF" "BUILD_EXAMPLES OFF" "BUILD_TESTING OFF" "FAST_BUILD ON"
3636
)
37-
# HiGHS v1.13.1 has debug assertions (ub_consistent) that fire on valid warm-start
38-
# MIP solves due to numerical noise. Suppress by defining NDEBUG on HiGHS targets.
37+
# HiGHS <=1.14.0 has a debug assertion (ub_consistent) that fires on valid
38+
# warm-start MIP solves due to rounding in primal-dual integral tracking
39+
# after presolve reset. The solution is correct; the bookkeeping tolerance
40+
# (1e-12) is too tight. Suppress by defining NDEBUG on HiGHS target.
41+
# Upstream: https://github.com/ERGO-Code/HiGHS — not yet fixed as of 1.14.0.
3942
if(TARGET highs)
4043
target_compile_definitions(highs PRIVATE NDEBUG)
4144
endif()

0 commit comments

Comments
 (0)