Fix ContextRelevancy score by Qard · Pull Request #159 · braintrustdata/autoevals

Qard · 2026-01-12T22:16:25Z

Fixes #80

github-actions · 2026-01-12T22:16:35Z

Braintrust eval report

Autoevals (fix-context-relevancy-scoring-1768258918)

Score	Average	Improvements	Regressions
NumericDiff	73.8% (+2pp)	7 🟢	1 🔴
Time_to_first_token	1.48tok (+0.1tok)	16 🟢	102 🔴
Llm_calls	1.55 (+0)	-	-
Tool_calls	0 (+0)	-	-
Errors	0 (+0)	-	-
Llm_errors	0 (+0)	-	-
Tool_errors	0 (+0)	-	-
Prompt_tokens	279.25tok (+0tok)	-	-
Prompt_cached_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_tokens	0tok (+0tok)	-	-
Completion_tokens	19.3tok (+0tok)	-	-
Completion_reasoning_tokens	0tok (+0tok)	-	-
Total_tokens	298.54tok (+0tok)	-	-
Estimated_cost	0$ (+0$)	-	-
Duration	1.5s (-1.27s)	109 🟢	110 🔴
Llm_duration	3.04s (+0.17s)	14 🟢	105 🔴

github-actions · 2026-01-12T22:16:35Z

Braintrust eval report

Autoevals (fix-context-relevancy-scoring-1768256188)

Score	Average	Improvements	Regressions
NumericDiff	73.8% (+0pp)	2 🟢	-
Time_to_first_token	1.49tok (+0.1tok)	15 🟢	103 🔴
Llm_calls	1.55 (+0)	-	-
Tool_calls	0 (+0)	-	-
Errors	0 (+0)	-	-
Llm_errors	0 (+0)	-	-
Tool_errors	0 (+0)	-	-
Prompt_tokens	279.25tok (+0tok)	-	-
Prompt_cached_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_tokens	0tok (+0tok)	-	-
Completion_tokens	19.3tok (+0tok)	-	-
Completion_reasoning_tokens	0tok (+0tok)	-	-
Total_tokens	298.54tok (+0tok)	-	-
Estimated_cost	0$ (0$)	1 🟢	-
Duration	1.5s (+0.04s)	60 🟢	159 🔴
Llm_duration	3.07s (+0.17s)	10 🟢	109 🔴

Fixes #80

github-actions · 2026-01-13T18:20:18Z

Braintrust eval report

Autoevals (main-1768328420)

Score	Average	Improvements	Regressions
NumericDiff	72.9% (0pp)	3 🟢	2 🔴
Time_to_first_token	1.32tok (-0.08tok)	97 🟢	22 🔴
Llm_calls	1.55 (+0)	-	-
Tool_calls	0 (+0)	-	-
Errors	0 (+0)	-	-
Llm_errors	0 (+0)	-	-
Tool_errors	0 (+0)	-	-
Prompt_tokens	279.25tok (+0tok)	-	-
Prompt_cached_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_tokens	0tok (+0tok)	-	-
Completion_tokens	19.3tok (+0tok)	-	-
Completion_reasoning_tokens	0tok (+0tok)	-	-
Total_tokens	298.54tok (+0tok)	-	-
Estimated_cost	0$ (+0$)	-	-
Duration	3.22s (+0.2s)	142 🟢	77 🔴
Llm_duration	2.72s (-0.19s)	106 🟢	13 🔴

Qard requested review from ankrgyl and ibolmo January 12, 2026 22:16

Qard self-assigned this Jan 12, 2026

Qard added the bug Something isn't working label Jan 12, 2026

Qard force-pushed the fix-context-relevancy-scoring branch from 18fe998 to 1c6bf5b Compare January 12, 2026 22:18

Fix ContextRelevancy score

ae31527

Fixes #80

Qard force-pushed the fix-context-relevancy-scoring branch from 1c6bf5b to ae31527 Compare January 12, 2026 23:01

Qard added lang:python lang:typescript labels Jan 12, 2026

clutchski approved these changes Jan 13, 2026

View reviewed changes

Qard merged commit 0e5793b into main Jan 13, 2026
7 checks passed

Qard deleted the fix-context-relevancy-scoring branch January 13, 2026 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ContextRelevancy score#159

Fix ContextRelevancy score#159
Qard merged 1 commit intomainfrom
fix-context-relevancy-scoring

Qard commented Jan 12, 2026

Uh oh!

github-actions bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 12, 2026

Uh oh!

Uh oh!

github-actions bot commented Jan 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Qard commented Jan 12, 2026

Uh oh!

github-actions bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Braintrust eval report

Uh oh!

github-actions bot commented Jan 12, 2026

Braintrust eval report

Uh oh!

Uh oh!

github-actions bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Braintrust eval report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Jan 12, 2026 •

edited

Loading

github-actions bot commented Jan 13, 2026 •

edited

Loading