Skip to content

Feat/wikipedia factual evaluator#84

Open
chaitanyamedidar wants to merge 7 commits intoAOSSIE-Org:mainfrom
chaitanyamedidar:feat/wikipedia-factual-evaluator
Open

Feat/wikipedia factual evaluator#84
chaitanyamedidar wants to merge 7 commits intoAOSSIE-Org:mainfrom
chaitanyamedidar:feat/wikipedia-factual-evaluator

Conversation

@chaitanyamedidar
Copy link

@chaitanyamedidar chaitanyamedidar commented Mar 21, 2026

Addressed Issues:

Implement Wikipedia-Based Factual Consistency Evaluator

implements the factual accuracy evaluation metric listed as a success criterion in the project specification

Screenshots/Recordings:

Additional Notes:

The project specification requires evaluating trained models on factual accuracy. Unlike existing factual benchmarks, this evaluator uses Wikipedia itself as the source of truth, directly consistent with the project's core requirement that Wikipedia is the sole trusted data source.

How it works:

  • Reads processed wiki_clean.txt produced by the existing pipeline
  • Groups lines into passages, extracts named entities using a capitalized word-sequence regex
  • For each sentence containing a named entity, generates a counterfactual variant by substituting that entity with a different entity from the same passage (e.g. "Einstein was born in Germany" --> "Einstein was born in Poland")
  • Compares model perplexity on original vs counterfactual sentence pairs
  • A well trained model should assign lower perplexity to factual sentences

Why this approach over existing benchmarks:

  • TruthfulQA and similar benchmarks use privately curated Q&A pairs --> inconsistent with the project's open data philosophy
  • This evaluator derives ground truth entirely from Wikipedia, making it fully reproducible and verifiable by anyone with the same dump
  • No external dataset dependency, pure stdlib only

Determinism:
random.seed(42) is applied inside evaluate() before any entity selection, ensuring fully reproducible evaluation results, consistent with the project's core reproducibility requirement

Files changed:

  • openverifiablellm/eval/factual/factual_consistency.py : WikipediaFactualEvaluator
  • openverifiablellm/eval/factual/__init__.py: exports WikipediaFactualEvaluator
  • openverifiablellm/eval/base.py : abstract BaseEvaluator
  • openverifiablellm/eval/perplexity.py : PerplexityEvaluator with shared compute_sentence_perplexity static method
  • openverifiablellm/eval/__init__.py : exports all evaluators
  • tests/test_factual_eval.py : 10 tests, all passing, no network calls, includes determinism test
  • pyproject.toml : added datasets as runtime dependency

Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Added a Perplexity evaluator to measure language-model perplexity on benchmarks.
    • Added a Wikipedia-based Factual Consistency evaluator that generates counterfactuals and reports comparative scores.
  • Tests

    • Added tests covering passage extraction, entity substitution, evaluator outputs, determinism, and sampling limits.
  • Chores

    • Updated project dependencies to include dataset handling and aligned dev extras.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

Walkthrough

Adds an evaluation framework: a BaseEvaluator protocol and two implementations — PerplexityEvaluator (dataset streaming, perplexity calculations) and WikipediaFactualEvaluator (counterfactual generation from wiki text and perplexity-based scoring). Exports updated, dependency added, and tests included.

Changes

Cohort / File(s) Summary
Core API & Exports
openverifiablellm/eval/base.py, openverifiablellm/eval/__init__.py
Introduce Model/Tokenizer protocols and abstract BaseEvaluator; expose PerplexityEvaluator at package level via __all__.
Perplexity Evaluator
openverifiablellm/eval/perplexity.py
New PerplexityEvaluator implementing sequence/sentence perplexity computation, stride/window scoring, a deterministic uniform_model for tests, and dataset streaming-based evaluate.
Factual Consistency Evaluator
openverifiablellm/eval/factual/factual_consistency.py, openverifiablellm/eval/factual/__init__.py
New WikipediaFactualEvaluator that extracts passages from wiki text, finds capitalized named-entity candidates, generates counterfactuals by entity substitution, and computes factual vs. counterfactual perplexities and a delta score.
Project Config
pyproject.toml
Add datasets to dependencies and add [project.optional-dependencies].dev with pytest and ruff.
Tests
tests/test_factual_eval.py
Add pytest suite covering passage extraction, entity substitution, evaluation outputs, numeric assertions, sample limits, and determinism via fixed random seed.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant PerplexityEvaluator
    participant Dataset as "HuggingFace Dataset"
    participant Tokenizer
    participant Model

    Client->>PerplexityEvaluator: evaluate(model, tokenizer)
    PerplexityEvaluator->>Dataset: load(benchmark, split, streaming=True)
    Dataset-->>PerplexityEvaluator: stream rows
    loop up to n_samples
        PerplexityEvaluator->>Tokenizer: encode(text)
        Tokenizer-->>PerplexityEvaluator: token_ids
        PerplexityEvaluator->>Model: forward(token_ids windowed)
        Model-->>PerplexityEvaluator: logits
        PerplexityEvaluator->>PerplexityEvaluator: compute log-probs, aggregate NLL
    end
    PerplexityEvaluator-->>Client: {"perplexity": mean_ppl}
Loading
sequenceDiagram
    participant Client
    participant WikiEvaluator as "WikipediaFactualEvaluator"
    participant FS as "Wiki Text File"
    participant Tokenizer
    participant Perplexity as "PerplexityEvaluator"
    participant Model

    Client->>WikiEvaluator: evaluate(model, tokenizer)
    WikiEvaluator->>FS: read(wiki_clean.txt)
    FS-->>WikiEvaluator: raw text
    WikiEvaluator->>WikiEvaluator: _extract_passages() -> [{original, counterfactual}, ...]
    loop for each pair
        WikiEvaluator->>Tokenizer: encode(original)
        Tokenizer-->>WikiEvaluator: ids_orig
        WikiEvaluator->>Tokenizer: encode(counterfactual)
        Tokenizer-->>WikiEvaluator: ids_cf
        WikiEvaluator->>Perplexity: compute_sentence_perplexity(model, ids_orig)
        Perplexity->>Model: forward(ids_orig)
        Model-->>Perplexity: logits
        Perplexity-->>WikiEvaluator: factual_ppl
        WikiEvaluator->>Perplexity: compute_sentence_perplexity(model, ids_cf)
        Perplexity->>Model: forward(ids_cf)
        Model-->>Perplexity: logits
        Perplexity-->>WikiEvaluator: counterfactual_ppl
    end
    WikiEvaluator-->>Client: {"factual_perplexity": fp, "counterfactual_perplexity": cp, "factual_score": cp-fp}
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 I nibble tokens, swap names with a hop,

facts turned playful, then measured non-stop.
Base roots planted, perplexities bloom,
wiki bits danced in a counterfactual room,
tests keep my hops steady — ready to pop! 🎉

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'Feat/wikipedia factual evaluator' clearly and concisely summarizes the main change: introducing a Wikipedia-based factual consistency evaluator as a new feature.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/eval/base.py`:
- Around line 7-24: Add typing.Protocol-based type hints for the evaluate method
to improve IDE/static analysis: define Protocols (e.g., Model and Tokenizer)
describing the callable model (__call__(input_ids: list[int]) ->
list[list[float]]) and the tokenizer (encode(text: str) -> list[int]) and import
required types from typing/typing_extensions; then update the abstractmethod
signature of evaluate to accept model: Model and tokenizer: Tokenizer and return
dict. Ensure the Protocol definitions and imports are placed near the top of
openverifiablellm/eval/base.py and referenced in the evaluate method signature.

In `@openverifiablellm/eval/factual/factual_consistency.py`:
- Around line 85-95: The replacement can accidentally match substrings; instead
of str.replace(found_entity, substitute, 1) use a regex substitution anchored to
word boundaries so only the whole found_entity token is replaced: build a
pattern using re.escape(found_entity) wrapped with r'\b...\b' and call
re.sub(pattern, substitute, sentence, count=1). Update the block that uses
_ENTITY_RE, matches, found_entity, substitute, and candidate_entities to perform
this regex-based replacement (ensure re is imported where this module uses it).
- Around line 217-235: The score calculation can produce inf/nan when
compute_sentence_perplexity returns non-finite values; in the loop over pairs
(referencing pairs, tokenizer.encode,
PerplexityEvaluator.compute_sentence_perplexity and the lists factual_ppls,
counterfactual_ppls, score_diffs) check math.isfinite(factual_ppl) and
math.isfinite(cf_ppl) before appending or computing cf_ppl - factual_ppl and
skip that pair if either is non-finite; after the loop handle the case n==0 by
returning sane defaults (e.g., float("nan") for each metric) to avoid division
by zero and propagation of nan.

In `@openverifiablellm/eval/perplexity.py`:
- Around line 201-215: The code currently hardcodes load_dataset(self.benchmark,
split="test", streaming=True) which can fail for benchmarks without a "test"
split; update the logic in perplexity.py to accept a configurable split (e.g.,
self.split) and/or implement a safe fallback sequence (try "test", then
"validation", then "train") when calling hf_datasets.load_dataset, and raise a
clear error if none exist; modify the call site where ds is created and ensure
downstream logic using ds (the loop that uses tokenizer.encode and
self.n_samples and compute_sequence_perplexity) remains unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4348df36-40bf-47e6-a433-89ccbef3f87e

📥 Commits

Reviewing files that changed from the base of the PR and between 578bc79 and e72d58b.

📒 Files selected for processing (7)
  • openverifiablellm/eval/__init__.py
  • openverifiablellm/eval/base.py
  • openverifiablellm/eval/factual/__init__.py
  • openverifiablellm/eval/factual/factual_consistency.py
  • openverifiablellm/eval/perplexity.py
  • pyproject.toml
  • tests/test_factual_eval.py

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 21, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 21, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/eval/factual/factual_consistency.py`:
- Around line 205-206: The evaluate() function currently calls random.seed(42)
which mutates global RNG; instead create a local RNG like rng =
random.Random(42), remove the global seed call, and pass this rng into
_extract_passages and any downstream substitution logic so all random operations
use rng (e.g., replace random.sample/random.choice calls inside
_extract_passages, substitution functions, or helper methods with
rng.sample/rng.choice). Update signatures for _extract_passages and any helper
functions to accept an rng parameter (defaulting to None or Random() if needed)
and thread it through to ensure determinism without touching global state.

In `@openverifiablellm/eval/perplexity.py`:
- Around line 125-161: The function compute_sequence_perplexity uses stride as
the loop step but does not validate it; add an early check at the top of
compute_sequence_perplexity to ensure stride is an integer >= 1 (e.g., if not
isinstance(stride, int) or stride < 1: raise ValueError("stride must be an
integer >= 1")), so zero or negative values (and non-integers) are rejected
before the for start in range(...) is executed.
- Around line 221-224: The loop currently hardcodes row.get("text") so add a
configurable parameter text_field (default "text") to the surrounding function
(the perplexity evaluator function that contains the loop over ds) and use
row.get(text_field, "") instead; validate after iterating a sample or the
dataset that at least one non-empty text was found and raise a clear
configuration error (or log and return) if none are present; propagate the new
text_field parameter through any public wrapper/CLI callers so benchmarks using
"content", "body", "article", etc. work correctly and update any related tests
to pass the alternative field name.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0e4a5eda-8185-4706-ab84-9727eeedaeeb

📥 Commits

Reviewing files that changed from the base of the PR and between e72d58b and 5f770d0.

📒 Files selected for processing (3)
  • openverifiablellm/eval/base.py
  • openverifiablellm/eval/factual/factual_consistency.py
  • openverifiablellm/eval/perplexity.py

Comment on lines +205 to +206
random.seed(42)
pairs = self._extract_passages(self.wiki_text_path, self.n_samples)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid mutating global RNG state in evaluate().

random.seed(42) resets process-wide randomness and can affect unrelated code paths. Use a local random.Random(42) instance and thread it through substitution logic.

Proposed fix
-    `@staticmethod`
-    def _substitute_entity(sentence: str, candidate_entities: List[str]) -> Optional[str]:
+    `@staticmethod`
+    def _substitute_entity(
+        sentence: str, candidate_entities: List[str], rng: random.Random
+    ) -> Optional[str]:
@@
-        substitute = random.choice(alternatives)
+        substitute = rng.choice(alternatives)
@@
-    def _extract_passages(
+    def _extract_passages(
         wiki_text_path: Union[str, Path],
         n_samples: Optional[int],
+        rng: random.Random,
     ) -> List[dict]:
@@
-                counterfactual = WikipediaFactualEvaluator._substitute_entity(
-                    sentence, all_entities
-                )
+                counterfactual = WikipediaFactualEvaluator._substitute_entity(
+                    sentence, all_entities, rng
+                )
@@
-        random.seed(42)
-        pairs = self._extract_passages(self.wiki_text_path, self.n_samples)
+        rng = random.Random(42)
+        pairs = self._extract_passages(self.wiki_text_path, self.n_samples, rng)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/eval/factual/factual_consistency.py` around lines 205 -
206, The evaluate() function currently calls random.seed(42) which mutates
global RNG; instead create a local RNG like rng = random.Random(42), remove the
global seed call, and pass this rng into _extract_passages and any downstream
substitution logic so all random operations use rng (e.g., replace
random.sample/random.choice calls inside _extract_passages, substitution
functions, or helper methods with rng.sample/rng.choice). Update signatures for
_extract_passages and any helper functions to accept an rng parameter
(defaulting to None or Random() if needed) and thread it through to ensure
determinism without touching global state.

Comment on lines +125 to +161
def compute_sequence_perplexity(model, token_ids: List[int], stride: int = 512) -> float:
"""
Compute perplexity over a (possibly long) sequence using non-overlapping
stride-sized windows.

The sequence is partitioned into windows of *stride* tokens. Each
window contributes its token predictions to a pooled NLL. The final
perplexity is ``exp(total_NLL / total_scored_tokens)``.

For sequences shorter than *stride* + 1 tokens the result is
identical to :meth:`compute_sentence_perplexity`.

Parameters
----------
model : callable
``model(input_ids) -> 2-D sequence`` of shape
``(len(input_ids), vocab_size)``.
token_ids : list[int]
Tokenised sequence.
stride : int
Number of tokens scored per window. Default ``512``.

Returns
-------
float
Perplexity (≥ 1). Returns ``float("inf")`` for sequences
shorter than 2 tokens.
"""
if len(token_ids) < 2:
return float("inf")

nll_sum = 0.0
n_scored = 0
n = len(token_ids)

for start in range(0, n - 1, stride):
end = min(start + stride + 1, n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate stride before using it as a window step.

stride is user-configurable; stride=0 raises at runtime and negative values yield invalid scoring behavior. Guard it early.

Proposed fix
 def __init__(
     self,
     benchmark: str = "wikitext",
     n_samples: Optional[int] = 50,
     stride: int = 512,
     split: Optional[str] = None,
 ):
+    if stride <= 0:
+        raise ValueError("stride must be a positive integer.")
     self.benchmark = benchmark
     self.n_samples = n_samples
     self.stride = stride
     self.split = split
@@
 `@staticmethod`
 def compute_sequence_perplexity(model, token_ids: List[int], stride: int = 512) -> float:
+    if stride <= 0:
+        raise ValueError("stride must be a positive integer.")
     if len(token_ids) < 2:
         return float("inf")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/eval/perplexity.py` around lines 125 - 161, The function
compute_sequence_perplexity uses stride as the loop step but does not validate
it; add an early check at the top of compute_sequence_perplexity to ensure
stride is an integer >= 1 (e.g., if not isinstance(stride, int) or stride < 1:
raise ValueError("stride must be an integer >= 1")), so zero or negative values
(and non-integers) are rejected before the for start in range(...) is executed.

Comment on lines +221 to +224
for row in ds:
text = row.get("text", "")
if not text.strip():
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify actual row keys for the configured benchmark/split(s).
# Update BENCHMARK as needed.
python - <<'PY'
import datasets

BENCHMARK = "wikitext"
splits = ["test", "validation", "train"]

for s in splits:
    try:
        ds = datasets.load_dataset(BENCHMARK, split=s, streaming=True)
        row = next(iter(ds))
        print(f"{s}: keys={list(row.keys())}")
    except Exception as e:
        print(f"{s}: unavailable ({type(e).__name__}: {e})")
PY

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 192


🏁 Script executed:

# First, locate the file and examine it
git ls-files | grep perplexity.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 108


🏁 Script executed:

# Read the perplexity.py file to see the implementation
wc -l openverifiablellm/eval/perplexity.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 112


🏁 Script executed:

# Read the entire file with line numbers
cat -n openverifiablellm/eval/perplexity.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 9490


🏁 Script executed:

# Search for test files related to perplexity
find . -name "*test*" -type f | grep -i perplex

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 54


🏁 Script executed:

# Search for usage of PerplexityEvaluator
rg "PerplexityEvaluator" -n

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1158


🏁 Script executed:

# Check the test file to see how PerplexityEvaluator is being used
cat -n tests/test_factual_eval.py | head -100

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 3445


🏁 Script executed:

# Check factual_consistency.py for how evaluate() is used
cat -n openverifiablellm/eval/factual/factual_consistency.py | head -250

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 11115


🏁 Script executed:

# Search for calls to .evaluate() method that might use the "text" field
rg "\.evaluate\(" -n | grep -i perplex

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 54


🏁 Script executed:

# Search for dataset loading patterns and benchmark usage
rg "load_dataset" -n -A 2 -B 2

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 895


🏁 Script executed:

# Check if there's any documentation or README about supported benchmarks
find . -name "README*" -o -name "*.md" | head -20

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 136


🏁 Script executed:

# Check README for information about supported benchmarks
cat -n README.md | head -150

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 5896


🏁 Script executed:

# Search for any test cases or examples that use different benchmarks
rg "benchmark" -n openverifiablellm/eval/ | head -30

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 884


🏁 Script executed:

# Check if there are any other references to field names or text extraction
rg "get\(" openverifiablellm/eval/perplexity.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 110


🏁 Script executed:

# Let's search for any comments or documentation about dataset field names
rg "field" -n openverifiablellm/eval/

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 54


Add configurable text_field parameter to support benchmarks with different field names.

Line 222 hardcodes the "text" field name when extracting text from dataset rows. For Hugging Face datasets that use different field names (e.g., content, body, article), all rows are silently skipped and the evaluator returns inf instead of surfacing the misconfiguration.

Proposed fix
 def __init__(
     self,
     benchmark: str = "wikitext",
     n_samples: Optional[int] = 50,
     stride: int = 512,
     split: Optional[str] = None,
+    text_field: str = "text",
 ):
     self.benchmark = benchmark
     self.n_samples = n_samples
     self.stride = stride
     self.split = split
+    self.text_field = text_field
@@
-    text = row.get("text", "")
+    text = row.get(self.text_field, "")
     if not text.strip():
         continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for row in ds:
text = row.get("text", "")
if not text.strip():
continue
def __init__(
self,
benchmark: str = "wikitext",
n_samples: Optional[int] = 50,
stride: int = 512,
split: Optional[str] = None,
text_field: str = "text",
):
self.benchmark = benchmark
self.n_samples = n_samples
self.stride = stride
self.split = split
self.text_field = text_field
# ... in the evaluate() method:
for row in ds:
text = row.get(self.text_field, "")
if not text.strip():
continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/eval/perplexity.py` around lines 221 - 224, The loop
currently hardcodes row.get("text") so add a configurable parameter text_field
(default "text") to the surrounding function (the perplexity evaluator function
that contains the loop over ds) and use row.get(text_field, "") instead;
validate after iterating a sample or the dataset that at least one non-empty
text was found and raise a clear configuration error (or log and return) if none
are present; propagate the new text_field parameter through any public
wrapper/CLI callers so benchmarks using "content", "body", "article", etc. work
correctly and update any related tests to pass the alternative field name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants