Parametrizing the sampling evals from the CLI #926

clefourrier · 2025-08-18T15:35:21Z

This PR does several things.

Contrain the Metric object creation.
We go from

    f1_score_macro = CorpusLevelMetric(
        metric_name="f1",
        sample_level_fn=GenerativePreparator().prepare,
        category=SamplingMethod.GENERATIVE,
        corpus_level_fn=CorpusLevelF1Score(average="macro").compute,
        higher_is_better=True,
    )

to

    f1_score_macro = CorpusLevelMetric(
        metric_name="f1",
        sample_level_fn=GenerativePreparator(),
        category=SamplingMethod.GENERATIVE,
        corpus_level_fn=CorpusLevelF1Score(average="macro"),
        higher_is_better=True,
    )

sample_level_fn must derive from either a SampleLevelComputation or Preparator class. The former must implement a compute method, the latter a prepare one.
corpus_level_fn either is a function (np.mean and so on), or a CorpusLevelComputation which must implement a .compute_corpus class.

All metrics with parametrizable sample_level_fn can now be parametrized at CLI call, example: "lighteval|math_500@k=1|0|0" (user can also use normalization function names if correctly defined in the normalization file). Corpus level parametrization is not supported but could probably be if we choose another symbol. This parametrization of Metrics.MyMetric relies on a trick in making the enum callable.
The metric list has been simplified to remove duplicate metrics which were only different by some params
All tasks have therefore been changed to use the new metrics names.
The test suite has been updated

… all old evals to the new format and figure out how to provide sane defaults

HuggingFaceDocBuilderDev · 2025-08-18T15:37:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ey should have been

… a name update with parameter change

docs/source/metric-list.mdx

examples/custom_tasks_tests.py

src/lighteval/metrics/dynamic_metrics.py

src/lighteval/metrics/metrics.py

src/lighteval/metrics/utils/metric_utils.py

src/lighteval/tasks/default_tasks.py

src/lighteval/tasks/registry.py

tests/tasks/test_lighteval_task.py

NathanHB

first round of review, few changes to do but overall so much better, great PR !

community_tasks/aimo_evals.py

src/lighteval/logging/evaluation_tracker.py

src/lighteval/metrics/utils/metric_utils.py

tests/reference_scores/SmolLM2-1.7B-Instruct-results-accelerate.json

src/lighteval/tasks/registry.py

src/lighteval/metrics/metrics_sample.py

src/lighteval/metrics/utils/metric_utils.py

src/lighteval/tasks/registry.py

Co-authored-by: Nathan Habib <[email protected]>

src/lighteval/metrics/utils/metric_utils.py

defined a sampling type for metrics, works for cli, now needs to port…

ead5bdb

… all old evals to the new format and figure out how to provide sane defaults

clefourrier marked this pull request as draft August 18, 2025 15:35

clefourrier and others added 16 commits August 18, 2025 15:44

rm useless case

8eece33

updated tests

8c5e5fb

fix test

ed0a02b

added conversion for normalizations

7394706

first pass transforming Hynek's metric functions into classes like th…

732c488

…ey should have been

imports

c0654c7

removed single token evals since we no longer have the feature, added…

a6e271a

… a name update with parameter change

keep on making metrics more adjustable

511d0e6

updating test suite given the new names

bc4bb7e

manual update of file

5d85a6e

manual update of file

917fb79

fix mcc single token

a367d73

now metrics are em

404d00f

some fixs for tests

f905750

rm trivia qa outdated

82b3fe9

removed dumdum enum overwrite

5c4f9ab