feat: Add progress logging during eval_set resume scanning by QuantumLove · Pull Request #3339 · UKGovernmentBEIS/inspect_ai

QuantumLove · 2026-02-26T12:07:05Z

Summary

When eval_set() resumes and scans S3 for existing logs, this can take 0-90 minutes for large eval sets. Currently there's no visibility into this process.

This PR adds progress logging using the existing ReadEvalLogsProgress infrastructure that already exists but wasn't wired into eval_set().

Changes

Added EvalSetScanProgress class that implements ReadEvalLogsProgress
Wired progress logging into the try_eval() function in eval_set()
Logs at reasonable intervals (every 100 files or 10%) to avoid spam

Log Output Example

INFO: Found 1,234 eval log files in s3://bucket/evals/abc123, reading headers...
INFO: Reading eval logs: 100/1234
INFO: Reading eval logs: 200/1234
...
INFO: Resume scan complete: 1,234 logs read in 45.3s (1,100 completed, 134 pending)

Testing

Unit tests for EvalSetScanProgress class (7 tests)
All existing test_eval_set.py tests pass
Type checking passes (pyright)

🤖 Generated with Claude Code

QuantumLove · 2026-02-26T12:21:09Z

Testing Summary                                                                                                                                                                                                                           
                                                                                                                                                                                                                                            
  How I Tested It                                                                                                                                                                                                                           
                                                                                                                                                                                                                                          
  1. Created 250 valid .json eval log files using Inspect AI's own write_eval_log() API to ensure they're in the correct format that list_eval_logs() can parse.                                                                            
  2. Ran eval_set() with:                                                                                                                                                                                                                 
    - log_dir pointing to the directory with existing logs
    - log_dir_allow_dirty=True to allow unrelated logs
    - log_level="info" to show INFO-level messages
    - display="none" to see raw logs instead of Rich UI
  3. The eval_set resume flow then:
    - Scanned the directory for existing .json log files
    - Read headers from all 250 files (triggering progress logging)
    - Determined 0 completed tasks, 1 pending task
    - Ran the new task

  Successful Test Output

  ======================================================================
  Running eval_set() - watch for progress logging below:
  (Look for 'Found X eval log files' and 'Reading eval logs: X/Y')
  ======================================================================

  [02/26/26 13:18:00] INFO     Found 250 eval log files in           evalset.py:96
                               /var/folders/x2/jkcxl7g52ld48p71lkmpc
                               zy00000gn/T/eval_set_progress_test_k7
                               m75fvx, reading headers...
                      INFO     Reading eval logs: 100/250           evalset.py:110
                      INFO     Reading eval logs: 200/250           evalset.py:110
                      INFO     Reading eval logs: 250/250           evalset.py:110
                      INFO     Resume scan complete: 250 logs read  evalset.py:117
                               in 0.1s (0 completed, 1 pending)

  eval_set completed: success=True, logs=1

  What Each Log Line Shows

  ┌────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────┐
  │                                  Log Message                                   │                                Purpose                                 │
  ├────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────┤
  │ Found 250 eval log files in .../eval_set_progress_test_..., reading headers... │ Initial scan discovered 250 files                                      │
  ├────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────┤
  │ Reading eval logs: 100/250                                                     │ Progress at 100 files (40%)                                            │
  ├────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────┤
  │ Reading eval logs: 200/250                                                     │ Progress at 200 files (80%)                                            │
  ├────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────┤
  │ Reading eval logs: 250/250                                                     │ Final progress at 250 files (100%)                                     │
  ├────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────┤
  │ Resume scan complete: 250 logs read in 0.1s (0 completed, 1 pending)           │ Summary: 0.1s scan time, 0 matching completed tasks, 1 new task to run │

@OverRide

When eval_set() resumes and scans S3 for existing logs, this can take 0-90 minutes for large eval sets with no visibility. This adds progress logging using the existing ReadEvalLogsProgress infrastructure. Changes: - Add EvalSetScanProgress class implementing ReadEvalLogsProgress - Wire progress logging into try_eval() function - Log at reasonable intervals (every 100 files or 10%, whichever is larger) - Add @OverRide decorators for type safety (consistent with codebase) - Add comprehensive unit tests Log output example: INFO: Found 1,234 eval log files in s3://..., reading headers... INFO: Reading eval logs: 100/1234 ... INFO: Resume scan complete: 1,234 logs read in 45.3s (1,100 completed, 134 pending) Co-Authored-By: Claude <[email protected]>

jjallaire · 2026-02-26T16:10:30Z

We just merged a PR (#3300) that should massively improve the performance here, possibly to the point where we don't need progress. @ransomr what do you think?

ransomr · 2026-02-26T16:59:05Z

There are two things that are potentially slow.

scanning for the logs. Async parallel reading of eval log headers #3189 sped this up dramatically.
reading samples from incomplete logs. perf: share AsyncFilesystem via ContextVar across async context #3300 sped this up.

The progress here is only for step 1 - @QuantumLove are you still seeing that part be slow enough that log progress is needed? In my testing it was down to ~12s for a 600 log data set. The output above shows 45s for 1200 log files. If you are still seeing this take many minutes I'd be interested in understanding the scenario more.

I would expect #2 to still be the slower part of the process, but maybe doesn't happen as much that we reuse partial logs.

If the evalset setup is still slow enough that we generally need progress, I think we should probably update the evalset textual UI to show it.

QuantumLove · 2026-02-26T17:02:19Z

Hey @jjallaire and @ransomr, thank you for the quick reply amd happy to hear about the performance enhancements.

We will upgrade to the latest version and report back what we see. Stay tuned!

jjallaire · 2026-02-26T18:21:56Z

Note that eval set actually has a phase where there is no textual UI (which only occurs during actual task execution) so it would be completely fine to do this progress at the terminal if we do in fact need it.

QuantumLove mentioned this pull request Feb 26, 2026

[PLT-591] Add runner entrypoint logging for eval set visibility METR/inspect-action#923

Closed

5 tasks

QuantumLove force-pushed the feat/eval-set-resume-progress-logging branch 2 times, most recently from f9f7349 to 5213d03 Compare February 26, 2026 13:06

QuantumLove force-pushed the feat/eval-set-resume-progress-logging branch from 5213d03 to c91877f Compare February 26, 2026 13:11

QuantumLove marked this pull request as ready for review February 26, 2026 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add progress logging during eval_set resume scanning#3339

feat: Add progress logging during eval_set resume scanning#3339
QuantumLove wants to merge 1 commit intoUKGovernmentBEIS:mainfrom
METR:feat/eval-set-resume-progress-logging

QuantumLove commented Feb 26, 2026

Uh oh!

QuantumLove commented Feb 26, 2026

Uh oh!

jjallaire commented Feb 26, 2026

Uh oh!

ransomr commented Feb 26, 2026

Uh oh!

QuantumLove commented Feb 26, 2026

Uh oh!

jjallaire commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

QuantumLove commented Feb 26, 2026

Summary

Changes

Log Output Example

Testing

Uh oh!

QuantumLove commented Feb 26, 2026

Uh oh!

jjallaire commented Feb 26, 2026

Uh oh!

ransomr commented Feb 26, 2026

Uh oh!

QuantumLove commented Feb 26, 2026

Uh oh!

jjallaire commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants