Skip to content

Enable multi-version evaluation and checkpoint replay #567

@joellidin

Description

@joellidin

Current Limitations

The evaluator currently:

  • Only evaluates the latest checkpoint discovered
  • Cannot replay/re-evaluate historical checkpoints
  • Doesn't support evaluating checkpoints from multiple model versions
  • Tracks evaluated windows but cannot systematically process all available checkpoints

Proposed Enhancement

Implement comprehensive checkpoint evaluation capabilities:

  1. Multi-version support: Allow evaluator to work with checkpoints from different model versions simultaneously
  2. Full replay mode: Add ability to systematically evaluate all historical checkpoints, not just the newest ones
  3. Batch evaluation: Process multiple checkpoints in sequence for comprehensive benchmarking

Checkpoint Storage Reorganization

To support this enhancement, we should migrate to a dedicated checkpoint storage architecture:

  • New bucket structure: Move checkpoints from the current validator bucket to a dedicated checkpoint bucket
  • Organized format: Use hierarchical structure: <version>/<window>/data
    • Example: v2.0.9/150/model.safetensors
    • Enables efficient version-based queries
    • Simplifies checkpoint discovery and management

Migration Strategy

Create a migration script that:

  • Copies all existing checkpoints from the current validator bucket
  • Reorganizes them into the new <version>/<window>/data structure
  • Preserves metadata and checkpoint integrity
  • Provides verification of successful migration

Benefits

  • Complete evaluation history for model performance tracking
  • Ability to compare performance across versions
  • Better checkpoint organization and discoverability
  • Enables retroactive evaluation with updated metrics
  • Supports A/B testing between model versions

Implementation Considerations

  • Maintain backward compatibility during transition
  • Implement efficient checkpoint discovery across versions
  • Add configuration for selective version evaluation
  • Consider checkpoint deduplication strategies
  • Implement progress tracking for batch evaluations

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions