-
Notifications
You must be signed in to change notification settings - Fork 1
fix(bench): stop silent-failure patterns from biasing LoCoMo scores #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -300,6 +300,14 @@ async def _process_qa_mem0( | |
| category = int(qa.get("category", 0)) | ||
| evidence = qa.get("evidence", []) or [] | ||
|
|
||
| # mem0 stored facts don't carry original LoCoMo dia_id metadata, so we | ||
| # cannot match against the gold evidence list. Report both hits and total | ||
| # as None — the metric is _unavailable_, not zero. _summary in the taosmd | ||
| # runner (and any downstream scorecard builder) skips None-valued rows | ||
| # from retrieval_recall so this artefact doesn't publish a fake 0.0. | ||
| # TODO: wire dia_id pass-through if mem0 adds metadata preservation. | ||
| EVIDENCE_UNAVAILABLE = {"evidence_hits": None, "evidence_total": None} | ||
|
|
||
| t0 = time.time() | ||
| try: | ||
| # mem0.search is synchronous; offload to thread pool to avoid blocking. | ||
|
|
@@ -313,67 +321,56 @@ async def _process_qa_mem0( | |
| "reference": reference, | ||
| "predicted": "", | ||
| "category": category, | ||
| "f1": 0.0, | ||
| "bleu1": 0.0, | ||
| "judge": 0.0, | ||
| "f1": None, | ||
| "bleu1": None, | ||
| "judge": None, | ||
| "retrieval_ms": 0.0, | ||
| "gen_ms": 0.0, | ||
| # evidence_hits/evidence_total not computable without dia_id metadata from | ||
| # mem0's stored facts (mem0 doesn't preserve per-turn dia_ids). Set to 0. | ||
| # TODO: if mem0 exposes metadata round-trip, wire dia_id here. | ||
| "evidence_hits": 0, | ||
| "evidence_total": len(evidence), | ||
| "error": str(exc), | ||
| **EVIDENCE_UNAVAILABLE, | ||
| "error": f"{type(exc).__name__}: {exc}", | ||
| } | ||
| retrieval_ms = (time.time() - t0) * 1000.0 | ||
|
|
||
| context = "\n---\n".join(context_chunks) if context_chunks else "" | ||
|
|
||
| t1 = time.time() | ||
| generation_error: str | None = None | ||
| try: | ||
| predicted = await _ollama_generate( | ||
| client, ollama_url, model, | ||
| ANSWER_PROMPT.format(context=context, question=question), | ||
| ) | ||
| except Exception as exc: | ||
| predicted = "" | ||
| gen_ms = (time.time() - t1) * 1000.0 | ||
| return { | ||
| "conversation_id": conv_id, | ||
| "question": question, | ||
| "reference": reference, | ||
| "predicted": predicted, | ||
| "category": category, | ||
| "f1": 0.0, | ||
| "bleu1": 0.0, | ||
| "judge": 0.0, | ||
| "retrieval_ms": round(retrieval_ms, 2), | ||
| "gen_ms": round(gen_ms, 2), | ||
| "evidence_hits": 0, | ||
| "evidence_total": len(evidence), | ||
| "error": str(exc), | ||
| } | ||
| generation_error = f"{type(exc).__name__}: {exc}" | ||
| gen_ms = (time.time() - t1) * 1000.0 | ||
|
|
||
| judge = await _judge(client, ollama_url, model, question, reference, predicted) | ||
| if generation_error is None: | ||
| judge = await _judge(client, ollama_url, model, question, reference, predicted) | ||
| f1_val = round(_f1(predicted, reference), 4) | ||
| bleu_val = round(_bleu1(predicted, reference), 4) | ||
| judge_val = round(judge, 4) if judge is not None else None | ||
|
Comment on lines
+348
to
+352
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Judge transport failures are still treated as wrong answers here. This branch expects 🤖 Prompt for AI Agents |
||
| else: | ||
| f1_val = None | ||
| bleu_val = None | ||
| judge_val = None | ||
|
|
||
| return { | ||
| row: dict = { | ||
| "conversation_id": conv_id, | ||
| "question": question, | ||
| "reference": reference, | ||
| "predicted": predicted, | ||
| "category": category, | ||
| "f1": round(_f1(predicted, reference), 4), | ||
| "bleu1": round(_bleu1(predicted, reference), 4), | ||
| "judge": round(judge, 4), | ||
| "f1": f1_val, | ||
| "bleu1": bleu_val, | ||
| "judge": judge_val, | ||
| "retrieval_ms": round(retrieval_ms, 2), | ||
| "gen_ms": round(gen_ms, 2), | ||
| # mem0 stored facts don't carry original dia_id metadata, so we cannot | ||
| # match against the LoCoMo gold evidence list. Set evidence_hits to 0. | ||
| # TODO: wire dia_id if mem0 adds metadata pass-through in a future release. | ||
| "evidence_hits": 0, | ||
| "evidence_total": len(evidence), | ||
| **EVIDENCE_UNAVAILABLE, | ||
| } | ||
| if generation_error is not None: | ||
| row["error"] = generation_error | ||
| return row | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These
Nonesentinels currently break_summary()in this file.Every returned row now carries
evidence_total=None, but the local_summary()still doesr.get("evidence_total", 0) > 0, so aggregation will raiseTypeErroron the first mem0 result. The newf1/bleu1/judge=Nonepaths also conflict with its rawsum(...) / naverages. Please port thebenchmarks/locomo_runner.py::_summaryfiltering/counting logic here before merging.Also applies to: 318-330, 348-369
🤖 Prompt for AI Agents