Skip to content

feat: Add progress logging during eval_set resume scanning#3339

Open
QuantumLove wants to merge 1 commit intoUKGovernmentBEIS:mainfrom
METR:feat/eval-set-resume-progress-logging
Open

feat: Add progress logging during eval_set resume scanning#3339
QuantumLove wants to merge 1 commit intoUKGovernmentBEIS:mainfrom
METR:feat/eval-set-resume-progress-logging

Conversation

@QuantumLove
Copy link

Summary

When eval_set() resumes and scans S3 for existing logs, this can take 0-90 minutes for large eval sets. Currently there's no visibility into this process.

This PR adds progress logging using the existing ReadEvalLogsProgress infrastructure that already exists but wasn't wired into eval_set().

Changes

  • Added EvalSetScanProgress class that implements ReadEvalLogsProgress
  • Wired progress logging into the try_eval() function in eval_set()
  • Logs at reasonable intervals (every 100 files or 10%) to avoid spam

Log Output Example

INFO: Found 1,234 eval log files in s3://bucket/evals/abc123, reading headers...
INFO: Reading eval logs: 100/1234
INFO: Reading eval logs: 200/1234
...
INFO: Resume scan complete: 1,234 logs read in 45.3s (1,100 completed, 134 pending)

Testing

  • Unit tests for EvalSetScanProgress class (7 tests)
  • All existing test_eval_set.py tests pass
  • Type checking passes (pyright)

🤖 Generated with Claude Code

@QuantumLove
Copy link
Author

Testing Summary                                                                                                                                                                                                                           
                                                                                                                                                                                                                                            
  How I Tested It                                                                                                                                                                                                                           
                                                                                                                                                                                                                                          
  1. Created 250 valid .json eval log files using Inspect AI's own write_eval_log() API to ensure they're in the correct format that list_eval_logs() can parse.                                                                            
  2. Ran eval_set() with:                                                                                                                                                                                                                 
    - log_dir pointing to the directory with existing logs
    - log_dir_allow_dirty=True to allow unrelated logs
    - log_level="info" to show INFO-level messages
    - display="none" to see raw logs instead of Rich UI
  3. The eval_set resume flow then:
    - Scanned the directory for existing .json log files
    - Read headers from all 250 files (triggering progress logging)
    - Determined 0 completed tasks, 1 pending task
    - Ran the new task

  Successful Test Output

  ======================================================================
  Running eval_set() - watch for progress logging below:
  (Look for 'Found X eval log files' and 'Reading eval logs: X/Y')
  ======================================================================

  [02/26/26 13:18:00] INFO     Found 250 eval log files in           evalset.py:96
                               /var/folders/x2/jkcxl7g52ld48p71lkmpc
                               zy00000gn/T/eval_set_progress_test_k7
                               m75fvx, reading headers...
                      INFO     Reading eval logs: 100/250           evalset.py:110
                      INFO     Reading eval logs: 200/250           evalset.py:110
                      INFO     Reading eval logs: 250/250           evalset.py:110
                      INFO     Resume scan complete: 250 logs read  evalset.py:117
                               in 0.1s (0 completed, 1 pending)

  eval_set completed: success=True, logs=1

  What Each Log Line Shows

  ┌────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────┐
  │                                  Log Message                                   │                                Purpose                                 │
  ├────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────┤
  │ Found 250 eval log files in .../eval_set_progress_test_..., reading headers... │ Initial scan discovered 250 files                                      │
  ├────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────┤
  │ Reading eval logs: 100/250                                                     │ Progress at 100 files (40%)                                            │
  ├────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────┤
  │ Reading eval logs: 200/250                                                     │ Progress at 200 files (80%)                                            │
  ├────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────┤
  │ Reading eval logs: 250/250                                                     │ Final progress at 250 files (100%)                                     │
  ├────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────┤
  │ Resume scan complete: 250 logs read in 0.1s (0 completed, 1 pending)           │ Summary: 0.1s scan time, 0 matching completed tasks, 1 new task to run │

@QuantumLove QuantumLove force-pushed the feat/eval-set-resume-progress-logging branch 2 times, most recently from f9f7349 to 5213d03 Compare February 26, 2026 13:06
When eval_set() resumes and scans S3 for existing logs, this can take
0-90 minutes for large eval sets with no visibility. This adds progress
logging using the existing ReadEvalLogsProgress infrastructure.

Changes:
- Add EvalSetScanProgress class implementing ReadEvalLogsProgress
- Wire progress logging into try_eval() function
- Log at reasonable intervals (every 100 files or 10%, whichever is larger)
- Add @OverRide decorators for type safety (consistent with codebase)
- Add comprehensive unit tests

Log output example:
  INFO: Found 1,234 eval log files in s3://..., reading headers...
  INFO: Reading eval logs: 100/1234
  ...
  INFO: Resume scan complete: 1,234 logs read in 45.3s (1,100 completed, 134 pending)

Co-Authored-By: Claude <[email protected]>
@QuantumLove QuantumLove force-pushed the feat/eval-set-resume-progress-logging branch from 5213d03 to c91877f Compare February 26, 2026 13:11
@QuantumLove QuantumLove marked this pull request as ready for review February 26, 2026 13:12
@jjallaire
Copy link
Collaborator

We just merged a PR (#3300) that should massively improve the performance here, possibly to the point where we don't need progress. @ransomr what do you think?

@ransomr
Copy link
Contributor

ransomr commented Feb 26, 2026

There are two things that are potentially slow.

  1. scanning for the logs. Async parallel reading of eval log headers #3189 sped this up dramatically.
  2. reading samples from incomplete logs. perf: share AsyncFilesystem via ContextVar across async context #3300 sped this up.

The progress here is only for step 1 - @QuantumLove are you still seeing that part be slow enough that log progress is needed? In my testing it was down to ~12s for a 600 log data set. The output above shows 45s for 1200 log files. If you are still seeing this take many minutes I'd be interested in understanding the scenario more.

I would expect #2 to still be the slower part of the process, but maybe doesn't happen as much that we reuse partial logs.

If the evalset setup is still slow enough that we generally need progress, I think we should probably update the evalset textual UI to show it.

@QuantumLove
Copy link
Author

Hey @jjallaire and @ransomr, thank you for the quick reply amd happy to hear about the performance enhancements.

We will upgrade to the latest version and report back what we see. Stay tuned!

@jjallaire
Copy link
Collaborator

Note that eval set actually has a phase where there is no textual UI (which only occurs during actual task execution) so it would be completely fine to do this progress at the terminal if we do in fact need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants