Add SFT validation eval with val_data by philippnormann · Pull Request #1850 · PrimeIntellect-ai/prime-rl

philippnormann · 2026-02-22T21:58:45Z

Summary

Add optional val_data and eval config blocks to SFT.
Run periodic validation inside SFT training and log val/loss and val/num_batches.
Add config validation that requires eval and val_data to be set together.
Add unit tests for config validation behavior.

Why

Train loss alone is not enough for checkpoint selection and overfitting detection.

Before

No native periodic validation signal in SFT runs.

After

SFT can emit validation metrics at configurable intervals during training.

Evidence

Reverse-text run showing periodic validation logging behavior.

train/loss	val/loss

Config used:

sft_fullft_rtext_split_200.toml

max_steps = 200

[ckpt]
interval = 20

[model]
name = "PrimeIntellect/Qwen3-0.6B"

[data]
name = "willcb/R1-reverse-wikipedia-paragraphs-v1-1000"
splits = ["train[:90%]"]
seq_len = 4096
batch_size = 32
shuffle = true
seed = 42

[val_data]
name = "willcb/R1-reverse-wikipedia-paragraphs-v1-1000"
splits = ["train[90%:]"]
seq_len = 4096
batch_size = 32
shuffle = false
seed = 42

[eval]
interval = 10
num_batches = 4

[optim]
lr = 2e-5

Validation

uv run pytest tests/unit/train/sft/test_sft_eval_config.py -q
Unit tests cover: eval without val_data (invalid), val_data without eval (invalid), and eval + val_data (valid).
200-step reverse-text run emits val/loss every 10 steps as configured.

Scope

This PR covers periodic SFT validation evaluation and config validation.

Note

Medium Risk
Touches core SFT training-loop and dataset-loading paths; while changes are straightforward, they can impact training correctness/performance and distributed metric aggregation if edge cases are missed.

Overview
Adds optional SFT validation via new SFTValConfig (sft.val) that loads a separate validation dataset and runs full-pass evaluation on a configurable interval (and optionally at step 0), logging val/loss.

Refactors SFT data loading by extracting load_sft_dataset() (expensive HF I/O) and extending setup_dataset() to accept a preloaded raw dataset plus max_epochs, enabling validation to reuse preloaded data.

Restructures the SFT training loop to centralize loss/forward-backward into helpers and to compute aggregated loss/NaN counts consistently across distributed ranks, with minor logging tweaks (e.g., only emitting max_vio when present).

^{Written by Cursor Bugbot for commit d80bb23. This will update automatically on new commits. Configure here.}

Apply CP compatibility checks to val_data, align eval scheduling with checkpoint step numbering, and document new SFT eval config fields in the changelog.

Add SFTEvalConfig.eval_on_start to support an explicit pre-training validation pass while keeping interval-based eval semantics unchanged by default.

samsja · 2026-03-09T22:09:29Z

+    def run_validation(step: int) -> None:
+        val_dataset = setup_dataset(
+            tokenizer, config.val.data, config.model.cp * config.model.tp, max_epochs=1, raw_dataset=val_raw_dataset
+        )
+        val_dataloader = setup_dataloader(val_dataset, config.val.data)
+
+        was_training = model.training
+        model.eval()
+        mean_loss, nan_count, _ = run_forward_loop(val_dataloader, backward=False)
+        if nan_count > 0:
+            logger.warning(f"Validation at step {step}: {nan_count} batches had NaN loss")
+        if mean_loss != mean_loss:
+            logger.warning(f"Validation at step {step} had no valid tokens")
+        else:
+            logger.success(f"Validation | Step {step} | Loss: {mean_loss:.4f}")
+        monitor.log({"val/loss": mean_loss, "step": step}, step=step)
+        if was_training:
+            model.train()
+
+    if config.val is not None and config.val.eval_on_start:
+        run_validation(progress.step)


we don't need a function here

samsja · 2026-03-09T22:12:43Z

+    if config.val is not None and config.val.eval_on_start:
+        run_validation(progress.step)


lets not do this, lets rather edit the if statement below to also include step 0 as valid if eval_on_start is set

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

hallerite · 2026-03-10T02:58:02Z

@philippnormann thank you for the PR! Sorry for taking so long to get it into a mergeable state – it was trickier than we expected, but I think it's in a good state now.

philippnormann · 2026-03-10T10:39:20Z

No worries, glad it made it in! Appreciate you and @samsja taking the time to get it into shape.

Also have #1849 open for SFT LoRA support if you get a chance to look at it.

cursor Bot reviewed Feb 22, 2026

View reviewed changes

Comment thread src/prime_rl/configs/sft.py Outdated

Comment thread src/prime_rl/trainer/sft/train.py Outdated

Comment thread src/prime_rl/configs/sft.py Outdated

philippnormann added 3 commits February 26, 2026 11:47

Add SFT validation eval with val_data

3cecc7d

Fix SFT validation config safety and step alignment

03343f2

Apply CP compatibility checks to val_data, align eval scheduling with checkpoint step numbering, and document new SFT eval config fields in the changelog.

Add optional eval-on-start for SFT validation

fbd6f90

Add SFTEvalConfig.eval_on_start to support an explicit pre-training validation pass while keeping interval-based eval semantics unchanged by default.

philippnormann force-pushed the feature/sft-val-eval branch from f2ba8af to fbd6f90 Compare February 26, 2026 10:48

cursor Bot reviewed Feb 26, 2026

View reviewed changes

Comment thread src/prime_rl/trainer/sft/train.py Outdated

philippnormann added 2 commits February 26, 2026 11:58

Fix SFT validation LoRA import and changelog entry

f26ccbb

Fix SFT eval test imports after config move

3c19f26

philippnormann mentioned this pull request Feb 27, 2026

Add SFT LoRA support #1849

Merged

refactor SFT validation config into single SFTValConfig

3741a1f

cursor Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread src/prime_rl/trainer/sft/train.py Outdated

samsja reviewed Mar 6, 2026

View reviewed changes

Comment thread tests/unit/train/sft/test_sft_eval_config.py Outdated

remove tests

a34ca73

samsja reviewed Mar 6, 2026

View reviewed changes

Comment thread src/prime_rl/configs/sft.py Outdated

cursor Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread src/prime_rl/configs/sft.py Outdated

Comment thread src/prime_rl/trainer/sft/train.py Outdated

hallerite and others added 3 commits March 6, 2026 04:46

factorize

bb6d3ea

remove eval_on_start

aee17b5

require sft val data

4ecfd60

cursor Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread src/prime_rl/trainer/sft/train.py Outdated

hallerite added 4 commits March 6, 2026 19:38

handle ac

9a3384d

refactor

bf4b6dd

add option for val at start

0bcdb39

add back some stuff

b5cdc93

philippnormann requested a review from samsja March 8, 2026 15:19

hallerite added 2 commits March 9, 2026 22:01

Merge remote-tracking branch 'origin/main' into feature/sft-val-eval

c8a008c

Merge remote-tracking branch 'origin/main' into feature/sft-val-eval

a2f88f7

samsja reviewed Mar 9, 2026

View reviewed changes

smarter gating

e49fd18

cursor Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread src/prime_rl/configs/sft.py

samsja approved these changes Mar 9, 2026

View reviewed changes

hallerite added 3 commits March 9, 2026 22:45

smol fix

47f3b30

fix no-recompile

67b828a

make sure forward is compiled

4d3d4b8

cursor Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread src/prime_rl/trainer/sft/train.py

hallerite added 2 commits March 9, 2026 23:13

fix timing metric

81fefe7

Merge branch 'main' into feature/sft-val-eval

d80bb23

samsja approved these changes Mar 10, 2026

View reviewed changes

hallerite merged commit 447cafe into PrimeIntellect-ai:main Mar 10, 2026
14 of 16 checks passed

philippnormann deleted the feature/sft-val-eval branch March 10, 2026 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SFT validation eval with val_data#1850

Add SFT validation eval with val_data#1850
hallerite merged 22 commits into
PrimeIntellect-ai:mainfrom
philippnormann:feature/sft-val-eval

philippnormann commented Feb 22, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samsja Mar 9, 2026

Uh oh!

samsja Mar 9, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

hallerite commented Mar 10, 2026

Uh oh!

philippnormann commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if config.val is not None and config.val.eval_on_start:
		run_validation(progress.step)

Conversation

philippnormann commented Feb 22, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Before

After

Evidence

Validation

Scope

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samsja Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

samsja Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hallerite commented Mar 10, 2026

Uh oh!

philippnormann commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

philippnormann commented Feb 22, 2026 •

edited by cursor Bot

Loading