Entity resolution registry for AI evaluation data. Maps raw model, benchmark, metric, and harness names from the EEE datastore to stable canonical IDs, and stores resolved evaluation results in a flat mapping table (eval_results).
Resolve a raw string against the hosted registry:
curl -X POST https://evaleval-entity-registry.hf.space/api/v1/resolve \
-H 'Content-Type: application/json' \
-d '{"raw_value": "MATH Level 5", "entity_type": "benchmark"}'{
"canonical_id": "math-level-5",
"strategy": "exact",
"confidence": 1.0,
"created_new": false,
"review_status": "reviewed",
"parent_canonical_id": "math"
}(Model-only fields like parents, root_model_id, lineage_origin_org_id, open_weights, release_date, params_billions are present too but null for benchmarks/metrics/harnesses — see API section for the full response shape.)
entity_type is one of benchmark, model, metric, harness, org. See the API section for batch resolve, entity browsing, and the full endpoint list.
git clone <repo>
cd eval-card-registry
uv sync
cp .env.example .env # defaults work for local dev1. Seed the registry with known entities:
uv run eval-card-registry seed --localThis loads benchmarks, metrics, and harnesses from seed/*.yaml into fixtures/*.parquet. You should see counts printed for each entity type. Note that these are automatically generated placeholders for internal development and will likely be changed in the future.
2. Check what's in the registry:
uv run eval-card-registry stats --localExpected output:
models total=0 draft=0
benchmarks total=61 draft=0
metrics total=22 draft=0
harnesses total=11 draft=0
aliases total=403 uncertain=0
eval_results total=0
resolution_log total=0
sync_runs total=0
3. Sync an EEE config — resolve entities and populate the mapping table:
uv run eval-card-registry sync --config hfopenllm_v2 --localThis downloads the EEE dataset config from HuggingFace (first run will take a few minutes), resolves every raw string to a canonical entity, and writes results to fixtures/eval_results.parquet — the mapping table (one row per model × benchmark × metric result).
4. Verify results:
uv run eval-card-registry stats --localYou should now see eval_results, aliases, and entity counts populated. Each row in eval_results looks like:
{
"evaluation_id": "hfopenllm_v2/...",
"result_index": 0,
"source_config": "hfopenllm_v2",
"model_id": "meta-llama/Llama-3.1-8B",
"harness_id": "lm-evaluation-harness",
"benchmark_id": "ifeval",
"parent_benchmark_id": null,
"metric_id": "accuracy",
"benchmark_card_id": null,
"score": 0.42,
"score_details": "{\"score\": 0.42}"
}Raw strings from EEE (e.g. "MATH Level 5", "lm-evaluation-harness") are resolved to canonical IDs (math, lm-evaluation-harness) through a strategy chain: exact alias match → normalized match (collapses case + all separators — spaces, hyphens, underscores, and slashes) → fuzzy stem match (strips known suffixes like -fc/-prompt, normalizes org prefixes) → auto-create draft. Every resolution is logged with its strategy and confidence score.
Canonical entities start as draft and can be promoted to reviewed. Aliases that fall below the confidence threshold are flagged uncertain for human review.
eval-card-registry/
├── packages/eval-entity-resolver/ # Standalone resolver package (uv workspace member)
├── src/eval_card_registry/ # FastAPI service + CLI
│ ├── api/ # Route handlers
│ ├── services/ # resolution_service, ingestion pipeline
│ └── store/ # In-memory store backed by HF Dataset parquet
├── seed/ # Known canonical entities (YAML)
│ ├── orgs.yaml # Curated orgs (kind: lab)
│ ├── benchmarks.yaml / metrics.yaml / harnesses.yaml
│ └── models/ # Three-layer model seed
│ ├── core.yaml # Hand-curated, source of truth
│ ├── sources/*.generated.yaml # Bulk imports (e.g. models.dev)
│ └── enrichments/aliases.yaml # Alias-only adds, union onto existing
├── scripts/ # One-shot tools and refresh scripts
├── fixtures/ # Local parquet files for offline dev/tests
└── tests/
The service and the resolver package are separate. The resolver can be imported directly by other pipelines (e.g. AutoBenchmarkCard) without pulling in the full service.
All commands require uv run prefix (or install the package first with uv pip install -e .).
# Seed known entities from seed/ YAML files
uv run eval-card-registry seed --local
# Print entity counts, draft counts, uncertain aliases
uv run eval-card-registry stats --local
# Sync one EEE config — resolves entities, writes to eval_results table
uv run eval-card-registry sync --config hfopenllm_v2 --local
# Sync all configs
uv run eval-card-registry sync --all --local
# Re-resolve everything (after updating seed data or fuzzy matching logic)
uv run eval-card-registry sync --config hfopenllm_v2 --rerun --localDrop --local and configure .env with HF credentials to read/write from HF Hub instead of fixtures/.
Start the server:
LOCAL_MODE=true uv run uvicorn eval_card_registry.main:app --reloadBase path: http://localhost:8000/api/v1
Resolve a raw string:
curl -X POST http://localhost:8000/api/v1/resolve \
-H 'Content-Type: application/json' \
-d '{"raw_value": "MATH Level 5", "entity_type": "benchmark", "source_config": "hfopenllm_v2"}'{
"canonical_id": "math-level-5",
"strategy": "exact",
"confidence": 1.0,
"created_new": false,
"review_status": "reviewed",
"parent_canonical_id": "math",
"resolved_leaf_id": null,
"root_model_id": null,
"lineage_origin_org_id": null,
"parents": null,
"open_weights": null,
"release_date": null,
"params_billions": null
}For models, the model-only fields are populated:
curl -X POST http://localhost:8000/api/v1/resolve \
-H 'Content-Type: application/json' \
-d '{"raw_value": "meta/llama-3.1-8b-instruct-turbo", "entity_type": "model"}'{
"canonical_id": "meta/llama-3.1-8b-instruct",
"strategy": "exact",
"confidence": 1.0,
"created_new": false,
"review_status": "reviewed",
"parent_canonical_id": null,
"resolved_leaf_id": "meta/llama-3.1-8b-instruct-turbo",
"root_model_id": "meta/llama-3.1-8b-instruct",
"lineage_origin_org_id": "meta",
"parents": [{"id": "meta/llama-3.1-8b-instruct", "relationship": "quantized"}],
"open_weights": true,
"release_date": "2024-07-18",
"params_billions": 8.0
}canonical_id is the identity root — for quantized chains it collapses to the unquantized base; the actual matched leaf is in resolved_leaf_id. For finetune/merge/adapter relationships, the leaf IS its own identity (no collapse). All metadata fields (open_weights, release_date, params_billions) come from the canonical_id row, so they describe the same entity the response identifies. See CLAUDE.md "Typed parents on canonical_models" for the full schema.
Batch resolve:
POST /api/v1/resolve/batch
Body: [{ "raw_value": "...", "entity_type": "..." }, ...]Entity CRUD (models, benchmarks, metrics, harnesses):
GET /api/v1/benchmarks?search=math&review_status=draft
GET /api/v1/benchmarks/{id}
POST /api/v1/benchmarks
PATCH /api/v1/benchmarks/{id}
Model IDs containing / (e.g. meta-llama/Llama-3.1-8B) work in path params directly.
Aliases:
GET /api/v1/aliases?status=uncertain&entity_type=benchmark
PATCH /api/v1/aliases/{id} # confirm, reject, or correct an alias
Health and stats:
GET /api/v1/health
GET /api/v1/stats
Interactive docs at http://localhost:8000/docs.
The eval-entity-resolver package can be used independently — no service required, and returns the same rich response shape as the HTTP API (root-collapse for quantized chains, parents, open_weights, release_date, params_billions, etc.):
from eval_entity_resolver import Resolver, ResolverConfig
# Load both aliases AND canonical entities from the production HF Dataset:
resolver = Resolver.from_hf("evaleval/entity-registry-data",
config=ResolverConfig(threshold=0.85))
# Or from a local parquet directory (e.g. after `eval-card-registry seed --local`):
resolver = Resolver.from_parquet("./fixtures/")
result = resolver.resolve(
raw_value="meta/llama-3.1-8b-instruct-turbo",
entity_type="model", # one of: model | benchmark | metric | harness | org
source_config=None, # optional; scopes to per-config aliases
)
# result is a `ResolutionResult` dataclass mirroring the HTTP API:
# raw_value, entity_type, source_config — echo of inputs
# canonical_id — identity root for quantized chains; None on no_match
# strategy, confidence — match info
# review_status — "draft" | "reviewed"
# parent_canonical_id — family/variant parent
# resolved_leaf_id — original match before root-collapse (models only)
# root_model_id — quantized-chain root (None when self IS the root)
# lineage_origin_org_id — upstream lab for finetunes / quants
# parents — full typed-edge list
# open_weights — bool / None
# release_date — YYYY-MM-DD / None
# params_billions — float / NoneFor example, resolving meta/llama-3.1-8b-instruct-turbo collapses to the unquantized base canonical; the original turbo id is preserved in resolved_leaf_id:
>>> r = resolver.resolve("meta/llama-3.1-8b-instruct-turbo", "model")
>>> r.canonical_id, r.resolved_leaf_id, r.open_weights
('meta/llama-3.1-8b-instruct', 'meta/llama-3.1-8b-instruct-turbo', True)If you really want the bare matcher (no metadata enrichment), you can construct Resolver without a CanonicalStore:
from eval_entity_resolver import AliasStore, Resolver
resolver = Resolver(AliasStore.from_parquet("./fixtures/")) # no canonical_store
# Now `result` only has `canonical_id`, `strategy`, `confidence`. All other
# fields are None. Useful when you don't have the canonical_models parquet
# (e.g. an alias-only HF dataset) or just want to avoid the lookup cost.Install from this workspace:
uv add eval-entity-resolver --workspaceuv run pytestTests use the in-memory fixture store — no HF credentials or network needed.
| Alias status | Meaning |
|---|---|
auto |
Resolved above confidence threshold — no review needed |
uncertain |
Below threshold or no match — auto-created draft, flagged for review |
confirmed |
Manually verified |
rejected |
Wrong match, excluded from future resolution |
Resolution order for a given (entity_type, raw_value):
- Config-scoped alias (
source_configmatches) - Global alias (
source_configis null) - Resolver chain (exact → normalized → fuzzy → auto-create draft)
Resolving the same raw string twice returns the same canonical ID. Re-running with --rerun re-evaluates existing aliases — prior resolution log entries are preserved.
| Entity | Format | Example |
|---|---|---|
| Model | {org_id}/{model-slug} |
meta/llama-3.1-8b, anthropic/claude-opus-4.5 |
| Benchmark / Metric / Harness | lowercase slug | math, lm-evaluation-harness |
eval_results row ID |
sha256(evaluation_id:result_index)[:16] |
a3f2b1c9d4e5f678 |
Entity IDs use human-readable slugs (not hashes) because they appear in seed files, API responses, and are referenced during manual curation. Internal row IDs (like eval_results.id) use deterministic hashes for uniform length and collision resistance.
For production, configure .env:
LOCAL_MODE=false
HF_TOKEN=hf_...
HF_DATASET_REPO=org/eval-card-registry
Then run the same commands without --local:
uv run eval-card-registry seed
uv run eval-card-registry sync --config hfopenllm_v2Data is stored as one parquet config per table in the HF Dataset repo.
The service can be deployed to a HuggingFace Space as a query-only disambiguation API — read-only resolve + entity/alias GETs, no writes.
Architecture:
- Space (
evaleval/entity-registry) — Docker SDK, runs FastAPI on port 7860 - Dataset repo (
evaleval/entity-registry-data) — entity parquet tables, read at startup - Storage Bucket (
evaleval/entity-registry-storage) — async resolve logs, written periodically
Read-only mode behaviour:
POST /resolveruns the full resolver chain but does NOT auto-create draft entities or write aliases on no_match —canonical_idisnullPOST/PATCHentity + alias endpoints return405 Method Not Allowed- Only 5 tables (models, benchmarks, metrics, harnesses, aliases) are loaded —
eval_results,resolution_log,sync_runsare skipped - Every resolve request is logged asynchronously to the Storage Bucket (buffered in memory, flushed every 5 min as partitioned parquet)
Deploy:
# Prerequisites: create the Space, Dataset repo, and Storage Bucket on HF;
# seed + sync the Dataset repo with entity data locally first.
bash deploy/push-to-space.shConfigure the Space in HF Space Settings:
| Variable | Type | Value |
|---|---|---|
HF_TOKEN |
Secret | Token with read access to dataset + write access to log bucket |
HF_DATASET_REPO |
Variable | evaleval/entity-registry-data |
HF_LOG_BUCKET |
Variable | evaleval/entity-registry-storage |
READ_ONLY=true and LOCAL_MODE=false are set in the Dockerfile ENV.
Local test of read-only mode:
READ_ONLY=true LOCAL_MODE=true uv run uvicorn eval_card_registry.main:app --reloadSee deploy/END_TO_END.md for a step-by-step verification guide (local smoke
test, Docker test, Space deploy + checks).
- Combine logic with EEE codebase's model registry and evalcard backend metric registry
- Verify metric extraction logic — although likely partially addressed with future schema versions and fixes.
- Clean up how we implement registry updates + check against regression
- Populate
benchmark_card_idonce an auto-benchmarkcard has been generated and linked for each benchmark. - Implement walking and backfilling for lineage
- Clarify entity type