Skip to content

evaleval/eval-card-registry

Repository files navigation

eval-card-registry

Entity resolution registry for AI evaluation data. Maps raw model, benchmark, metric, and harness names from the EEE datastore to stable canonical IDs, and stores resolved evaluation results in a flat mapping table (eval_results).


Quickstart

Resolve a raw string against the hosted registry:

curl -X POST https://evaleval-entity-registry.hf.space/api/v1/resolve \
  -H 'Content-Type: application/json' \
  -d '{"raw_value": "MATH Level 5", "entity_type": "benchmark"}'
{
  "canonical_id": "math-level-5",
  "strategy": "exact",
  "confidence": 1.0,
  "created_new": false,
  "review_status": "reviewed",
  "parent_canonical_id": "math"
}

(Model-only fields like parents, root_model_id, lineage_origin_org_id, open_weights, release_date, params_billions are present too but null for benchmarks/metrics/harnesses — see API section for the full response shape.)

entity_type is one of benchmark, model, metric, harness, org. See the API section for batch resolve, entity browsing, and the full endpoint list.


Local development

git clone <repo>
cd eval-card-registry
uv sync
cp .env.example .env          # defaults work for local dev

1. Seed the registry with known entities:

uv run eval-card-registry seed --local

This loads benchmarks, metrics, and harnesses from seed/*.yaml into fixtures/*.parquet. You should see counts printed for each entity type. Note that these are automatically generated placeholders for internal development and will likely be changed in the future.

2. Check what's in the registry:

uv run eval-card-registry stats --local

Expected output:

  models      total=0  draft=0
  benchmarks  total=61  draft=0
  metrics     total=22  draft=0
  harnesses   total=11  draft=0

  aliases        total=403  uncertain=0
  eval_results   total=0
  resolution_log total=0
  sync_runs      total=0

3. Sync an EEE config — resolve entities and populate the mapping table:

uv run eval-card-registry sync --config hfopenllm_v2 --local

This downloads the EEE dataset config from HuggingFace (first run will take a few minutes), resolves every raw string to a canonical entity, and writes results to fixtures/eval_results.parquet — the mapping table (one row per model × benchmark × metric result).

4. Verify results:

uv run eval-card-registry stats --local

You should now see eval_results, aliases, and entity counts populated. Each row in eval_results looks like:

{
  "evaluation_id": "hfopenllm_v2/...",
  "result_index": 0,
  "source_config": "hfopenllm_v2",
  "model_id": "meta-llama/Llama-3.1-8B",
  "harness_id": "lm-evaluation-harness",
  "benchmark_id": "ifeval",
  "parent_benchmark_id": null,
  "metric_id": "accuracy",
  "benchmark_card_id": null,
  "score": 0.42,
  "score_details": "{\"score\": 0.42}"
}

How it works

Raw strings from EEE (e.g. "MATH Level 5", "lm-evaluation-harness") are resolved to canonical IDs (math, lm-evaluation-harness) through a strategy chain: exact alias match → normalized match (collapses case + all separators — spaces, hyphens, underscores, and slashes) → fuzzy stem match (strips known suffixes like -fc/-prompt, normalizes org prefixes) → auto-create draft. Every resolution is logged with its strategy and confidence score.

Canonical entities start as draft and can be promoted to reviewed. Aliases that fall below the confidence threshold are flagged uncertain for human review.


Project layout

eval-card-registry/
├── packages/eval-entity-resolver/        # Standalone resolver package (uv workspace member)
├── src/eval_card_registry/               # FastAPI service + CLI
│   ├── api/                              # Route handlers
│   ├── services/                         # resolution_service, ingestion pipeline
│   └── store/                            # In-memory store backed by HF Dataset parquet
├── seed/                                 # Known canonical entities (YAML)
│   ├── orgs.yaml                         # Curated orgs (kind: lab)
│   ├── benchmarks.yaml / metrics.yaml / harnesses.yaml
│   └── models/                           # Three-layer model seed
│       ├── core.yaml                     # Hand-curated, source of truth
│       ├── sources/*.generated.yaml      # Bulk imports (e.g. models.dev)
│       └── enrichments/aliases.yaml      # Alias-only adds, union onto existing
├── scripts/                              # One-shot tools and refresh scripts
├── fixtures/                             # Local parquet files for offline dev/tests
└── tests/

The service and the resolver package are separate. The resolver can be imported directly by other pipelines (e.g. AutoBenchmarkCard) without pulling in the full service.


CLI reference

All commands require uv run prefix (or install the package first with uv pip install -e .).

# Seed known entities from seed/ YAML files
uv run eval-card-registry seed --local

# Print entity counts, draft counts, uncertain aliases
uv run eval-card-registry stats --local

# Sync one EEE config — resolves entities, writes to eval_results table
uv run eval-card-registry sync --config hfopenllm_v2 --local

# Sync all configs
uv run eval-card-registry sync --all --local

# Re-resolve everything (after updating seed data or fuzzy matching logic)
uv run eval-card-registry sync --config hfopenllm_v2 --rerun --local

Drop --local and configure .env with HF credentials to read/write from HF Hub instead of fixtures/.


API

Start the server:

LOCAL_MODE=true uv run uvicorn eval_card_registry.main:app --reload

Base path: http://localhost:8000/api/v1

Resolve a raw string:

curl -X POST http://localhost:8000/api/v1/resolve \
  -H 'Content-Type: application/json' \
  -d '{"raw_value": "MATH Level 5", "entity_type": "benchmark", "source_config": "hfopenllm_v2"}'
{
  "canonical_id": "math-level-5",
  "strategy": "exact",
  "confidence": 1.0,
  "created_new": false,
  "review_status": "reviewed",
  "parent_canonical_id": "math",
  "resolved_leaf_id": null,
  "root_model_id": null,
  "lineage_origin_org_id": null,
  "parents": null,
  "open_weights": null,
  "release_date": null,
  "params_billions": null
}

For models, the model-only fields are populated:

curl -X POST http://localhost:8000/api/v1/resolve \
  -H 'Content-Type: application/json' \
  -d '{"raw_value": "meta/llama-3.1-8b-instruct-turbo", "entity_type": "model"}'
{
  "canonical_id": "meta/llama-3.1-8b-instruct",
  "strategy": "exact",
  "confidence": 1.0,
  "created_new": false,
  "review_status": "reviewed",
  "parent_canonical_id": null,
  "resolved_leaf_id": "meta/llama-3.1-8b-instruct-turbo",
  "root_model_id": "meta/llama-3.1-8b-instruct",
  "lineage_origin_org_id": "meta",
  "parents": [{"id": "meta/llama-3.1-8b-instruct", "relationship": "quantized"}],
  "open_weights": true,
  "release_date": "2024-07-18",
  "params_billions": 8.0
}

canonical_id is the identity root — for quantized chains it collapses to the unquantized base; the actual matched leaf is in resolved_leaf_id. For finetune/merge/adapter relationships, the leaf IS its own identity (no collapse). All metadata fields (open_weights, release_date, params_billions) come from the canonical_id row, so they describe the same entity the response identifies. See CLAUDE.md "Typed parents on canonical_models" for the full schema.

Batch resolve:

POST /api/v1/resolve/batch
Body: [{ "raw_value": "...", "entity_type": "..." }, ...]

Entity CRUD (models, benchmarks, metrics, harnesses):

GET    /api/v1/benchmarks?search=math&review_status=draft
GET    /api/v1/benchmarks/{id}
POST   /api/v1/benchmarks
PATCH  /api/v1/benchmarks/{id}

Model IDs containing / (e.g. meta-llama/Llama-3.1-8B) work in path params directly.

Aliases:

GET    /api/v1/aliases?status=uncertain&entity_type=benchmark
PATCH  /api/v1/aliases/{id}   # confirm, reject, or correct an alias

Health and stats:

GET  /api/v1/health
GET  /api/v1/stats

Interactive docs at http://localhost:8000/docs.


Using the resolver standalone

The eval-entity-resolver package can be used independently — no service required, and returns the same rich response shape as the HTTP API (root-collapse for quantized chains, parents, open_weights, release_date, params_billions, etc.):

from eval_entity_resolver import Resolver, ResolverConfig

# Load both aliases AND canonical entities from the production HF Dataset:
resolver = Resolver.from_hf("evaleval/entity-registry-data",
                            config=ResolverConfig(threshold=0.85))

# Or from a local parquet directory (e.g. after `eval-card-registry seed --local`):
resolver = Resolver.from_parquet("./fixtures/")

result = resolver.resolve(
    raw_value="meta/llama-3.1-8b-instruct-turbo",
    entity_type="model",           # one of: model | benchmark | metric | harness | org
    source_config=None,             # optional; scopes to per-config aliases
)
# result is a `ResolutionResult` dataclass mirroring the HTTP API:
#   raw_value, entity_type, source_config — echo of inputs
#   canonical_id          — identity root for quantized chains; None on no_match
#   strategy, confidence  — match info
#   review_status         — "draft" | "reviewed"
#   parent_canonical_id   — family/variant parent
#   resolved_leaf_id      — original match before root-collapse (models only)
#   root_model_id         — quantized-chain root (None when self IS the root)
#   lineage_origin_org_id — upstream lab for finetunes / quants
#   parents               — full typed-edge list
#   open_weights          — bool / None
#   release_date          — YYYY-MM-DD / None
#   params_billions       — float / None

For example, resolving meta/llama-3.1-8b-instruct-turbo collapses to the unquantized base canonical; the original turbo id is preserved in resolved_leaf_id:

>>> r = resolver.resolve("meta/llama-3.1-8b-instruct-turbo", "model")
>>> r.canonical_id, r.resolved_leaf_id, r.open_weights
('meta/llama-3.1-8b-instruct', 'meta/llama-3.1-8b-instruct-turbo', True)

If you really want the bare matcher (no metadata enrichment), you can construct Resolver without a CanonicalStore:

from eval_entity_resolver import AliasStore, Resolver
resolver = Resolver(AliasStore.from_parquet("./fixtures/"))  # no canonical_store
# Now `result` only has `canonical_id`, `strategy`, `confidence`. All other
# fields are None. Useful when you don't have the canonical_models parquet
# (e.g. an alias-only HF dataset) or just want to avoid the lookup cost.

Install from this workspace:

uv add eval-entity-resolver --workspace

Tests

uv run pytest

Tests use the in-memory fixture store — no HF credentials or network needed.


Resolution behaviour

Alias status Meaning
auto Resolved above confidence threshold — no review needed
uncertain Below threshold or no match — auto-created draft, flagged for review
confirmed Manually verified
rejected Wrong match, excluded from future resolution

Resolution order for a given (entity_type, raw_value):

  1. Config-scoped alias (source_config matches)
  2. Global alias (source_config is null)
  3. Resolver chain (exact → normalized → fuzzy → auto-create draft)

Resolving the same raw string twice returns the same canonical ID. Re-running with --rerun re-evaluates existing aliases — prior resolution log entries are preserved.


ID conventions

Entity Format Example
Model {org_id}/{model-slug} meta/llama-3.1-8b, anthropic/claude-opus-4.5
Benchmark / Metric / Harness lowercase slug math, lm-evaluation-harness
eval_results row ID sha256(evaluation_id:result_index)[:16] a3f2b1c9d4e5f678

Entity IDs use human-readable slugs (not hashes) because they appear in seed files, API responses, and are referenced during manual curation. Internal row IDs (like eval_results.id) use deterministic hashes for uniform length and collision resistance.


HF Hub deployment

For production, configure .env:

LOCAL_MODE=false
HF_TOKEN=hf_...
HF_DATASET_REPO=org/eval-card-registry

Then run the same commands without --local:

uv run eval-card-registry seed
uv run eval-card-registry sync --config hfopenllm_v2

Data is stored as one parquet config per table in the HF Dataset repo.


HF Space deployment (query-only API)

The service can be deployed to a HuggingFace Space as a query-only disambiguation API — read-only resolve + entity/alias GETs, no writes.

Architecture:

  • Space (evaleval/entity-registry) — Docker SDK, runs FastAPI on port 7860
  • Dataset repo (evaleval/entity-registry-data) — entity parquet tables, read at startup
  • Storage Bucket (evaleval/entity-registry-storage) — async resolve logs, written periodically

Read-only mode behaviour:

  • POST /resolve runs the full resolver chain but does NOT auto-create draft entities or write aliases on no_match — canonical_id is null
  • POST/PATCH entity + alias endpoints return 405 Method Not Allowed
  • Only 5 tables (models, benchmarks, metrics, harnesses, aliases) are loaded — eval_results, resolution_log, sync_runs are skipped
  • Every resolve request is logged asynchronously to the Storage Bucket (buffered in memory, flushed every 5 min as partitioned parquet)

Deploy:

# Prerequisites: create the Space, Dataset repo, and Storage Bucket on HF;
# seed + sync the Dataset repo with entity data locally first.

bash deploy/push-to-space.sh

Configure the Space in HF Space Settings:

Variable Type Value
HF_TOKEN Secret Token with read access to dataset + write access to log bucket
HF_DATASET_REPO Variable evaleval/entity-registry-data
HF_LOG_BUCKET Variable evaleval/entity-registry-storage

READ_ONLY=true and LOCAL_MODE=false are set in the Dockerfile ENV.

Local test of read-only mode:

READ_ONLY=true LOCAL_MODE=true uv run uvicorn eval_card_registry.main:app --reload

See deploy/END_TO_END.md for a step-by-step verification guide (local smoke test, Docker test, Space deploy + checks).


TODO

  • Combine logic with EEE codebase's model registry and evalcard backend metric registry
  • Verify metric extraction logic — although likely partially addressed with future schema versions and fixes.
  • Clean up how we implement registry updates + check against regression
  • Populate benchmark_card_id once an auto-benchmarkcard has been generated and linked for each benchmark.
  • Implement walking and backfilling for lineage
  • Clarify entity type

About

Registry for eval-related entities

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages