diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..d17cb16 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,96 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +Autoevals is a dual-language library (TypeScript + Python) for evaluating AI model outputs. It provides LLM-as-a-judge evaluations, heuristic scorers (Levenshtein distance), and statistical metrics (BLEU). Developed by Braintrust. + +## Commands + +### TypeScript (in root directory) + +```bash +pnpm install # Install dependencies +pnpm run build # Build JS (outputs to jsdist/) +pnpm run test # Run all JS tests with vitest +pnpm run test -- js/llm.test.ts # Run single test file +pnpm run test -- -t "test name" # Run specific test by name +``` + +### Python (from root directory) + +```bash +make develop # Set up Python venv with all dependencies +source env.sh # Activate the venv +pytest # Run all Python tests +pytest py/autoevals/test_llm.py # Run single test file +pytest py/autoevals/test_llm.py::test_openai # Run specific test +pytest -k "test_name" # Run tests matching pattern +``` + +### Linting + +```bash +pre-commit run --all-files # Run all linters (black, ruff, prettier, codespell) +make fixup # Same as above +``` + +## Architecture + +### Dual Implementation Pattern + +The library maintains parallel implementations in TypeScript (`js/`) and Python (`py/autoevals/`). Both share: + +- The same evaluation templates (`templates/*.yaml`) +- The same `Score` interface: `{name, score (0-1), metadata}` +- The same scorer names and behavior + +### Key Modules (both languages) + +- `llm.ts` / `llm.py` - LLM-as-a-judge scorers (Factuality, Battle, ClosedQA, Humor, Security, Sql, Summary, Translation) +- `ragas.ts` / `ragas.py` - RAG evaluation metrics (ContextRelevancy, Faithfulness, AnswerRelevancy, etc.) +- `string.ts` / `string.py` - Text similarity (Levenshtein, EmbeddingSimilarity) +- `json.ts` / `json.py` - JSON validation and diff +- `oai.ts` / `oai.py` - OpenAI client wrapper with caching +- `score.ts` / `score.py` - Core Score type and Scorer base class + +### Template System + +YAML templates in `templates/` define LLM classifier prompts. Templates use Mustache syntax (`{{variable}}`). The `LLMClassifier` class loads these templates and handles: + +- Prompt rendering with chain-of-thought (CoT) suffix +- Tool-based response parsing via `select_choice` function +- Score mapping from choice letters to numeric scores + +### Python Scorer Pattern + +```python +class Scorer(ABC): + def eval(self, output, expected=None, **kwargs) -> Score # Sync + async def eval_async(self, output, expected=None, **kwargs) # Async + def __call__(...) # Alias for eval() +``` + +### TypeScript Scorer Pattern + +```typescript +type Scorer = ( + args: ScorerArgs, +) => Score | Promise; +// All scorers are async functions +``` + +## Environment Variables + +Tests require: + +- `OPENAI_API_KEY` or `BRAINTRUST_API_KEY` - For LLM-based evaluations +- `OPENAI_BASE_URL` (optional) - Custom API endpoint + +## Testing Notes + +- Python tests use `pytest` with `respx` for HTTP mocking +- TypeScript tests use `vitest` with `msw` for HTTP mocking +- Tests that call real LLM APIs need valid API keys +- Test files are colocated: `test_*.py` (Python), `*.test.ts` (TypeScript) diff --git a/README.md b/README.md index fb5e907..a5eda5d 100644 --- a/README.md +++ b/README.md @@ -361,6 +361,8 @@ Eval( - Numeric difference - JSON diff +For detailed documentation on all scorers, including parameters, score ranges, and usage examples, see the [**Scorer Reference**](SCORERS.md). + ## Custom evaluation prompts Autoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism: diff --git a/SCORERS.md b/SCORERS.md new file mode 100644 index 0000000..9515142 --- /dev/null +++ b/SCORERS.md @@ -0,0 +1,628 @@ +# Autoevals scorer reference + +Complete reference for all scorers available in Autoevals, including parameters, score ranges, and usage examples. + +## Table of contents + +- [LLM-as-a-judge scorers](#llm-as-a-judge-scorers) +- [RAG (Retrieval-Augmented Generation) scorers](#rag-retrieval-augmented-generation-scorers) +- [Heuristic scorers](#heuristic-scorers) +- [JSON scorers](#json-scorers) +- [List scorers](#list-scorers) + +--- + +## LLM-as-a-judge scorers + +These scorers use language models to evaluate outputs based on semantic understanding. + +### Factuality + +Evaluates whether the output is factually consistent with the expected answer. + +**Parameters:** + +- `input` (string): The input question or prompt +- `output` (string, required): The generated answer to evaluate +- `expected` (string, required): The ground truth answer +- `model` (string, optional): Model to use (default: configured via `init()` or "gpt-4o") +- `client` (Client, optional): Custom OpenAI client + +**Score Range:** 0-1 + +- `1.0` = Output is factually accurate +- `0.0` = Output contains factual errors + +**Example:** + +```typescript +import { Factuality } from "autoevals"; + +const result = await Factuality({ + input: "What is the capital of France?", + output: "Paris", + expected: "The capital of France is Paris", +}); +// Score: 1.0 (factually correct) +``` + +### Battle + +Compares two outputs and determines which one is better. + +**Parameters:** + +- `input` (string): The input question or prompt +- `output` (string, required): First answer to compare +- `expected` (string, required): Second answer to compare +- `model` (string, optional): Model to use +- `client` (Client, optional): Custom OpenAI client + +**Score Range:** 0-1 + +- `1.0` = Output is significantly better than expected +- `0.5` = Both outputs are roughly equal +- `0.0` = Expected is significantly better than output + +**Example:** + +```python +from autoevals.llm import Battle + +evaluator = Battle() +result = evaluator.eval( + input="Explain photosynthesis", + output="Plants use sunlight to make food from CO2 and water", + expected="Photosynthesis is a process" +) +# Score: ~1.0 (first answer is more detailed) +``` + +### ClosedQA + +Evaluates answers to closed-ended questions where there's a clear correct answer. + +**Parameters:** + +- `input` (string): The question +- `output` (string, required): The generated answer +- `expected` (string, required): The correct answer +- `model` (string, optional): Model to use +- `criteria` (string, optional): Custom evaluation criteria + +**Score Range:** 0-1 + +- `1.0` = Answer is correct +- `0.0` = Answer is incorrect + +### Humor + +Evaluates whether the output is humorous. + +**Parameters:** + +- `input` (string): The context or setup +- `output` (string, required): The text to evaluate for humor +- `model` (string, optional): Model to use + +**Score Range:** 0-1 + +- `1.0` = Very humorous +- `0.0` = Not humorous + +### Security + +Evaluates whether the output contains security vulnerabilities or unsafe content. + +**Parameters:** + +- `output` (string, required): The content to evaluate +- `model` (string, optional): Model to use + +**Score Range:** 0-1 + +- `1.0` = No security concerns +- `0.0` = Contains security vulnerabilities + +### Moderation + +Evaluates content for policy violations using OpenAI's moderation API. + +**Parameters:** + +- `output` (string, required): The content to moderate +- `client` (Client, optional): Custom OpenAI client + +**Score Range:** 0-1 + +- `1.0` = Content is safe +- `0.0` = Content violates policies + +**Categories Checked:** + +- Sexual content +- Hate speech +- Harassment +- Self-harm +- Violence +- Sexual content involving minors + +### Sql + +Evaluates SQL query correctness and quality. + +**Parameters:** + +- `input` (string): The natural language question +- `output` (string, required): The generated SQL query +- `expected` (string, optional): The correct SQL query +- `model` (string, optional): Model to use + +**Score Range:** 0-1 + +### Summary + +Evaluates the quality of text summaries. + +**Parameters:** + +- `input` (string): The original text +- `output` (string, required): The generated summary +- `expected` (string, optional): A reference summary +- `model` (string, optional): Model to use + +**Score Range:** 0-1 + +- `1.0` = Excellent summary (accurate, concise, complete) +- `0.0` = Poor summary + +### Translation + +Evaluates translation quality. + +**Parameters:** + +- `input` (string): The source text +- `output` (string, required): The generated translation +- `expected` (string, optional): A reference translation +- `model` (string, optional): Model to use + +**Score Range:** 0-1 + +- `1.0` = Excellent translation +- `0.0` = Poor translation + +--- + +## RAG (Retrieval-Augmented Generation) scorers + +These scorers evaluate RAG systems by assessing both context retrieval and answer generation quality. + +All RAG scorers support passing `context` through the `metadata` parameter when used with Braintrust Eval. See the [RAGAS module documentation](js/ragas.ts) for examples. + +### ContextRelevancy + +Evaluates how relevant the retrieved context is to the input question. + +**Parameters:** + +- `input` (string, required): The question +- `output` (string, required): The generated answer +- `context` (string[] | string, required): Retrieved context passages +- `model` (string, optional): Model to use (default: "gpt-4o-mini") + +**Score Range:** 0-1 + +- `1.0` = All context is highly relevant +- `0.0` = Context is irrelevant + +**Example:** + +```python +from autoevals.ragas import ContextRelevancy + +scorer = ContextRelevancy() +result = scorer.eval( + input="What is the capital of France?", + output="Paris", + context=[ + "Paris is the capital of France.", + "Berlin is the capital of Germany." + ] +) +# Score: ~0.5 (only first context item is relevant) +``` + +### ContextRecall + +Measures how well the context supports the expected answer. + +**Parameters:** + +- `input` (string): The question +- `expected` (string, required): The ground truth answer +- `context` (string[] | string, required): Retrieved context passages +- `model` (string, optional): Model to use + +**Score Range:** 0-1 + +- `1.0` = Context fully supports the expected answer +- `0.0` = Context doesn't support the expected answer + +### ContextPrecision + +Measures the precision of retrieved context - whether relevant context appears before irrelevant context. + +**Parameters:** + +- `input` (string, required): The question +- `expected` (string, required): The ground truth answer +- `context` (string[] | string, required): Retrieved context passages (order matters) +- `model` (string, optional): Model to use + +**Score Range:** 0-1 + +- `1.0` = All relevant context appears first +- `0.0` = Relevant context is buried under irrelevant context + +### ContextEntityRecall + +Measures how well the context contains entities from the expected answer. + +**Parameters:** + +- `expected` (string, required): The ground truth answer +- `context` (string[] | string, required): Retrieved context passages +- `model` (string, optional): Model to use + +**Score Range:** 0-1 + +- `1.0` = All entities from expected answer are in context +- `0.0` = No entities from expected answer are in context + +### Faithfulness + +Evaluates whether the generated answer's claims are supported by the context. + +**Parameters:** + +- `input` (string): The question +- `output` (string, required): The generated answer +- `context` (string[] | string, required): Retrieved context passages +- `model` (string, optional): Model to use + +**Score Range:** 0-1 + +- `1.0` = All claims in the answer are supported by context +- `0.0` = Answer contains unsupported claims (hallucinations) + +**Example:** + +```typescript +import { Faithfulness } from "autoevals/ragas"; + +const result = await Faithfulness({ + input: "What is photosynthesis?", + output: + "Photosynthesis is how plants make food using sunlight and also they can fly.", + context: [ + "Photosynthesis is the process by which plants use sunlight to synthesize foods.", + ], +}); +// Score: ~0.5 (first claim supported, "can fly" is not) +``` + +### AnswerRelevancy + +Measures how relevant the answer is to the question. + +**Parameters:** + +- `input` (string, required): The question +- `output` (string, required): The generated answer +- `context` (string[] | string, optional): Retrieved context passages +- `model` (string, optional): Model to use +- `embedding_model` (string, optional): Model for embeddings (default: "text-embedding-3-small") + +**Score Range:** 0-1 + +- `1.0` = Answer directly addresses the question +- `0.0` = Answer is off-topic + +### AnswerSimilarity + +Compares semantic similarity between the generated answer and expected answer using embeddings. + +**Parameters:** + +- `output` (string, required): The generated answer +- `expected` (string, required): The ground truth answer +- `model` (string, optional): Embedding model to use (default: "text-embedding-3-small") + +**Score Range:** 0-1 + +- `1.0` = Answers are semantically identical +- `0.0` = Answers are completely different + +### AnswerCorrectness + +Combines factual correctness and semantic similarity to evaluate answers. + +**Parameters:** + +- `input` (string, required): The question +- `output` (string, required): The generated answer +- `expected` (string, required): The ground truth answer +- `model` (string, optional): Model for factuality checking +- `embedding_model` (string, optional): Model for similarity (default: "text-embedding-3-small") +- `factuality_weight` (number, optional): Weight for factuality (default: 0.75) +- `answer_similarity_weight` (number, optional): Weight for similarity (default: 0.25) + +**Score Range:** 0-1 + +**Formula:** `score = (factuality_weight × factuality_score + answer_similarity_weight × similarity_score) / (factuality_weight + answer_similarity_weight)` + +--- + +## Heuristic scorers + +Fast, deterministic scorers that don't use LLMs. + +### Levenshtein + +Calculates Levenshtein (edit) distance between strings, normalized to 0-1. + +**Parameters:** + +- `output` (string, required): The generated text +- `expected` (string, required): The expected text + +**Score Range:** 0-1 + +- `1.0` = Strings are identical +- `0.0` = Strings are completely different + +**Example:** + +```python +from autoevals.string import Levenshtein + +scorer = Levenshtein() +result = scorer.eval(output="hello", expected="helo") +# Score: ~0.8 (1 character difference) +``` + +### ExactMatch + +Binary scorer that checks for exact string equality. + +**Parameters:** + +- `output` (any, required): The generated value +- `expected` (any, required): The expected value + +**Score Range:** 0 or 1 + +- `1` = Values are exactly equal +- `0` = Values differ + +### NumericDiff + +Evaluates numeric differences with configurable thresholds. + +**Parameters:** + +- `output` (number, required): The generated number +- `expected` (number, required): The expected number +- `max_diff` (number, optional): Maximum acceptable difference (default: 0) +- `relative` (boolean, optional): Use relative difference (default: false) + +**Score Range:** 0-1 + +- `1.0` = Numbers are equal (within threshold) +- `0.0` = Numbers differ significantly + +**Formula (absolute):** `score = max(0, 1 - |output - expected| / max_diff)` (when max_diff > 0) + +**Formula (relative):** `score = max(0, 1 - |output - expected| / |expected|)` + +**Example:** + +```typescript +import { NumericDiff } from "autoevals"; + +// Absolute difference +const result1 = await NumericDiff({ + output: 10.5, + expected: 10.0, + maxDiff: 1.0, +}); +// Score: 0.5 (difference of 0.5 out of max 1.0) + +// Relative difference +const result2 = await NumericDiff({ + output: 100, + expected: 110, + relative: true, +}); +// Score: ~0.91 (10% difference) +``` + +### EmbeddingSimilarity + +Compares semantic similarity using text embeddings (cosine similarity). + +**Parameters:** + +- `output` (string, required): The generated text +- `expected` (string, required): The expected text +- `model` (string, optional): Embedding model (default: "text-embedding-3-small") +- `client` (Client, optional): Custom OpenAI client + +**Score Range:** -1 to 1 (typically 0-1 for text) + +- `1.0` = Semantically identical +- `0.0` = Unrelated +- `-1.0` = Opposite meanings (rare) + +--- + +## JSON scorers + +Scorers for evaluating JSON outputs. + +### JSONDiff + +Recursively compares JSON objects with customizable string and number comparison. + +**Parameters:** + +- `output` (any, required): The generated JSON +- `expected` (any, required): The expected JSON +- `string_scorer` (Scorer, optional): Scorer for string values (default: Levenshtein) +- `number_scorer` (Scorer, optional): Scorer for numeric values (default: NumericDiff) +- `preserve_strings` (boolean, optional): Don't auto-parse JSON strings (default: false) + +**Score Range:** 0-1 + +- `1.0` = JSON structures are identical +- `0.0` = JSON structures are completely different + +**Example:** + +```python +from autoevals.json import JSONDiff + +scorer = JSONDiff() +result = scorer.eval( + output={"name": "John", "age": 30}, + expected={"name": "John", "age": 31} +) +# Score: ~0.5 (name matches, age differs slightly) +``` + +### ValidJSON + +Validates JSON syntax and optionally checks against a JSON Schema. + +**Parameters:** + +- `output` (any, required): The value to validate +- `schema` (object, optional): JSON Schema to validate against + +**Score Range:** 0 or 1 + +- `1` = Valid JSON (and matches schema if provided) +- `0` = Invalid JSON or doesn't match schema + +**Example:** + +```typescript +import { ValidJSON } from "autoevals/json"; + +const schema = { + type: "object", + properties: { + name: { type: "string" }, + age: { type: "number" }, + }, + required: ["name", "age"], +}; + +const result = await ValidJSON({ + output: '{"name": "John", "age": 30}', + schema, +}); +// Score: 1 (valid JSON matching schema) +``` + +--- + +## List scorers + +Scorers for evaluating lists and arrays. + +### ListContains + +Checks if all expected items are present in the output list. + +**Parameters:** + +- `output` (any[], required): The generated list +- `expected` (any[], required): Items that should be present +- `scorer` (Scorer, optional): Scorer for comparing individual items + +**Score Range:** 0-1 + +- `1.0` = All expected items are present +- `0.0` = None of the expected items are present + +**Example:** + +```python +from autoevals.list import ListContains + +scorer = ListContains() +result = scorer.eval( + output=["apple", "banana", "cherry"], + expected=["apple", "banana"] +) +# Score: 1.0 (both expected items present) +``` + +--- + +## Custom scorers + +You can create custom scorers for domain-specific evaluation needs. See: + +- [JSON scorer examples](py/autoevals/json.py) - Combining validators and comparators +- [Creating custom scorers](README.md#creating-custom-scorers) - Basic custom scorer pattern + +--- + +## Score interpretation + +General guidelines for interpreting scores: + +- **1.0**: Perfect match or complete correctness +- **0.8-0.99**: Very good, minor differences +- **0.6-0.79**: Acceptable, some issues +- **0.4-0.59**: Moderate quality, significant issues +- **0.2-0.39**: Poor quality, major issues +- **0.0-0.19**: Unacceptable or completely wrong + +Note: Interpretation varies by scorer type. Binary scorers (ExactMatch, ValidJSON) only return 0 or 1. + +--- + +## Common parameters + +Many scorers share these common parameters: + +- `model` (string): LLM model to use for evaluation (default: configured via `init()` or "gpt-4o") +- `client` (Client): Custom OpenAI-compatible client +- `use_cot` (boolean): Enable chain-of-thought reasoning for LLM scorers (default: true) +- `temperature` (number): LLM temperature setting +- `max_tokens` (number): Maximum tokens for LLM response + +## Configuring defaults + +Use `init()` to configure default settings for all scorers: + +```typescript +import { init } from "autoevals"; +import OpenAI from "openai"; + +init({ + client: new OpenAI({ apiKey: "..." }), + defaultModel: "gpt-4o", +}); +``` + +```python +from autoevals import init +from openai import OpenAI + +init(OpenAI(api_key="..."), default_model="gpt-4o") +``` diff --git a/js/json.ts b/js/json.ts index 82cb646..50be17a 100644 --- a/js/json.ts +++ b/js/json.ts @@ -1,3 +1,73 @@ +/** + * JSON evaluation scorers for comparing and validating JSON data. + * + * This module provides scorers for working with JSON data: + * + * - **JSONDiff**: Compare JSON objects for structural and content similarity + * - **ValidJSON**: Validate if a value is valid JSON and matches an optional schema + * + * ## Creating Custom JSON Scorers + * + * You can create custom JSON scorers by composing existing scorers or building new ones: + * + * @example + * ```typescript + * import { Scorer } from "autoevals"; + * import { JSONDiff, ValidJSON } from "autoevals/json"; + * import { EmbeddingSimilarity } from "autoevals/string"; + * + * // Custom scorer that validates JSON schema then compares semantically + * const myJSONScorer: Scorer = async ({ output, expected, schema }) => { + * // First, validate both outputs against schema + * const outputValid = await ValidJSON({ output, schema }); + * const expectedValid = await ValidJSON({ output: expected, schema }); + * + * if (outputValid.score === 0 || expectedValid.score === 0) { + * return { + * name: "CustomJSONScorer", + * score: 0, + * error: "Invalid JSON format" + * }; + * } + * + * // Then compare using semantic similarity for strings + * return JSONDiff({ + * output, + * expected, + * stringScorer: EmbeddingSimilarity + * }); + * }; + * + * // Custom scorer for specific JSON structure validation + * const apiResponseScorer: Scorer = async ({ output }) => { + * const parsed = typeof output === "string" ? JSON.parse(output) : output; + * + * let score = 0; + * const errors: string[] = []; + * + * // Check required fields + * if (parsed.status) score += 0.3; + * else errors.push("Missing status field"); + * + * if (parsed.data) score += 0.3; + * else errors.push("Missing data field"); + * + * // Check data structure + * if (parsed.data?.items && Array.isArray(parsed.data.items)) { + * score += 0.4; + * } else { + * errors.push("data.items must be an array"); + * } + * + * return { + * name: "APIResponseScorer", + * score: Math.min(score, 1), + * metadata: { errors } + * }; + * }; + * ``` + */ + import { Scorer } from "./score"; import { NumericDiff } from "./number"; import { LevenshteinScorer } from "./string"; @@ -5,8 +75,48 @@ import Ajv, { JSONSchemaType, Schema } from "ajv"; import { makePartial, ScorerWithPartial } from "./partial"; /** - * A simple scorer that compares JSON objects, using a customizable comparison method for strings - * (defaults to Levenshtein) and numbers (defaults to NumericDiff). + * Compare JSON objects for structural and content similarity. + * + * This scorer recursively compares JSON objects, handling: + * - Nested dictionaries and arrays + * - String similarity using Levenshtein distance (or custom scorer) + * - Numeric value comparison (or custom scorer) + * - Automatic parsing of JSON strings + * + * @example + * ```typescript + * import { JSONDiff } from "autoevals"; + * import { EmbeddingSimilarity } from "autoevals/string"; + * + * // Basic comparison + * const result = await JSONDiff({ + * output: { + * name: "John Smith", + * age: 30, + * skills: ["python", "javascript"] + * }, + * expected: { + * name: "John A. Smith", + * age: 31, + * skills: ["python", "typescript"] + * } + * }); + * console.log(result.score); // Similarity score between 0-1 + * + * // With custom string scorer using embeddings + * const semanticResult = await JSONDiff({ + * output: { description: "A fast car" }, + * expected: { description: "A quick automobile" }, + * stringScorer: EmbeddingSimilarity + * }); + * ``` + * + * @param output - The JSON object or string to evaluate + * @param expected - The expected JSON object or string to compare against + * @param stringScorer - Optional custom scorer for string comparisons (default: LevenshteinScorer) + * @param numberScorer - Optional custom scorer for number comparisons (default: NumericDiff) + * @param preserveStrings - Don't attempt to parse strings as JSON (default: false) + * @returns Score object with similarity score between 0-1 */ export const JSONDiff: ScorerWithPartial< any, @@ -38,8 +148,54 @@ export const JSONDiff: ScorerWithPartial< ); /** - * A binary scorer that evaluates the validity of JSON output, optionally validating against a - * JSON Schema definition (see https://json-schema.org/learn/getting-started-step-by-step#create). + * Validate if a value is valid JSON and optionally matches a JSON Schema. + * + * This scorer checks if: + * - The input can be parsed as valid JSON (if it's a string) + * - The parsed JSON matches an optional JSON Schema + * - Handles both string inputs and pre-parsed JSON objects + * + * @example + * ```typescript + * import { ValidJSON } from "autoevals"; + * + * // Basic JSON validation + * const result1 = await ValidJSON({ + * output: '{"name": "John", "age": 30}' + * }); + * console.log(result1.score); // 1 (valid JSON) + * + * const result2 = await ValidJSON({ + * output: '{invalid json}' + * }); + * console.log(result2.score); // 0 (invalid JSON) + * + * // With schema validation + * const schema = { + * type: "object", + * properties: { + * name: { type: "string" }, + * age: { type: "number" } + * }, + * required: ["name", "age"] + * }; + * + * const result3 = await ValidJSON({ + * output: { name: "John", age: 30 }, + * schema + * }); + * console.log(result3.score); // 1 (matches schema) + * + * const result4 = await ValidJSON({ + * output: { name: "John" }, // missing required "age" + * schema + * }); + * console.log(result4.score); // 0 (doesn't match schema) + * ``` + * + * @param output - The value to validate (string or object) + * @param schema - Optional JSON Schema to validate against (see https://json-schema.org) + * @returns Score object with score of 1 if valid, 0 otherwise */ export const ValidJSON: ScorerWithPartial = makePartial( async ({ output, schema }) => { diff --git a/js/ragas.ts b/js/ragas.ts index d5a5285..a975ec0 100644 --- a/js/ragas.ts +++ b/js/ragas.ts @@ -1,4 +1,104 @@ -/*These metrics are ported, with some enhancements, from the [RAGAS](https://github.com/explodinggradients/ragas) project. */ +/** + * RAGAS (Retrieval-Augmented Generation Assessment) metrics for evaluating RAG systems. + * + * These metrics are ported, with some enhancements, from the [RAGAS](https://github.com/explodinggradients/ragas) project. + * + * ## Context Quality Evaluators + * + * - **ContextEntityRecall**: Measures how well context contains expected entities + * - **ContextRelevancy**: Evaluates relevance of context to question + * - **ContextRecall**: Checks if context supports expected answer + * - **ContextPrecision**: Measures precision of context relative to question + * + * ## Answer Quality Evaluators + * + * - **Faithfulness**: Checks if answer claims are supported by context + * - **AnswerRelevancy**: Measures answer relevance to question + * - **AnswerSimilarity**: Compares semantic similarity to expected answer + * - **AnswerCorrectness**: Evaluates factual correctness against ground truth + * + * @example + * // Direct usage + * import { init } from "autoevals"; + * import { Faithfulness, ContextRelevancy } from "autoevals/ragas"; + * import OpenAI from "openai"; + * + * // Initialize with your OpenAI client + * init({ client: new OpenAI() }); + * + * // Evaluate context relevance + * const relevancyResult = await ContextRelevancy({ + * input: "What is the capital of France?", + * output: "Paris is the capital of France", + * context: [ + * "Paris is the capital of France.", + * "The city is known for the Eiffel Tower." + * ] + * }); + * console.log(relevancyResult.score); // 1.0 for highly relevant + * + * // Check answer faithfulness + * const faithfulnessResult = await Faithfulness({ + * input: "What is France's capital city?", + * output: "Paris is the capital of France and has the Eiffel Tower", + * context: [ + * "Paris is the capital of France.", + * "The city is known for the Eiffel Tower." + * ] + * }); + * console.log(faithfulnessResult.score); // 1.0 for fully supported + * + * @example + * // Using with Braintrust Eval + * import { Eval } from "braintrust"; + * import { init } from "autoevals"; + * import { Faithfulness, ContextRelevancy } from "autoevals/ragas"; + * import OpenAI from "openai"; + * + * // Initialize autoevals + * init({ client: new OpenAI() }); + * + * // Dataset with context in metadata + * const dataset = [ + * { + * input: "What is the capital of France?", + * expected: "Paris", + * metadata: { + * context: [ + * "Paris is the capital of France.", + * "Berlin is the capital of Germany." + * ] + * } + * }, + * // ... more examples + * ]; + * + * // Create scorer functions that extract context from metadata + * const faithfulnessScorer = ({ output, input, metadata }) => { + * return Faithfulness({ + * input, + * output, + * context: metadata.context || [] + * }); + * }; + * + * const contextRelevancyScorer = ({ output, input, metadata }) => { + * return ContextRelevancy({ + * input, + * output, + * context: metadata.context || [] + * }); + * }; + * + * // Run evaluation + * Eval("my-rag-eval", { + * data: () => dataset, + * task: (input) => generateAnswer(input), // Your LLM function + * scores: [faithfulnessScorer, contextRelevancyScorer] + * }); + * + * @module ragas + */ import mustache from "mustache"; import { Scorer, ScorerArgs } from "./score"; diff --git a/py/autoevals/json.py b/py/autoevals/json.py index 23147d0..a299e29 100644 --- a/py/autoevals/json.py +++ b/py/autoevals/json.py @@ -2,15 +2,85 @@ This module provides scorers for working with JSON data: -- JSONDiff: Compare JSON objects for structural and content similarity +- **JSONDiff**: Compare JSON objects for structural and content similarity - Handles nested structures, strings, numbers - Customizable with different scorers for string and number comparisons - Can automatically parse JSON strings -- ValidJSON: Validate if a string is valid JSON and matches an optional schema +- **ValidJSON**: Validate if a string is valid JSON and matches an optional schema - Validates JSON syntax - Optional JSON Schema validation - Works with both strings and parsed objects + +Creating Custom JSON Scorers +----------------------------- + +You can create custom JSON scorers by composing existing scorers or building new ones: + +Example 1: Combine schema validation with semantic comparison + ```python + from autoevals import Scorer, Score + from autoevals.json import JSONDiff, ValidJSON + from autoevals.string import EmbeddingSimilarity + from openai import OpenAI + + class CustomJSONScorer(Scorer): + def __init__(self, schema=None): + self.schema = schema + self.validator = ValidJSON(schema=schema) + self.differ = JSONDiff(string_scorer=EmbeddingSimilarity(client=OpenAI())) + + def _run_eval_sync(self, output, expected=None, **kwargs): + # First validate both against schema + output_valid = self.validator.eval(output) + expected_valid = self.validator.eval(expected) + + if output_valid.score == 0 or expected_valid.score == 0: + return Score( + name="CustomJSONScorer", + score=0, + error="Invalid JSON format" + ) + + # Then compare semantically + return self.differ.eval(output, expected) + ``` + +Example 2: Custom scorer for API response validation + ```python + from autoevals import Scorer, Score + import json + + class APIResponseScorer(Scorer): + def _run_eval_sync(self, output, **kwargs): + parsed = json.loads(output) if isinstance(output, str) else output + + score = 0 + errors = [] + + # Check required fields + if parsed.get("status"): + score += 0.3 + else: + errors.append("Missing status field") + + if parsed.get("data"): + score += 0.3 + else: + errors.append("Missing data field") + + # Check data structure + if isinstance(parsed.get("data", {}).get("items"), list): + score += 0.4 + else: + errors.append("data.items must be an array") + + return Score( + name="APIResponseScorer", + score=min(score, 1), + metadata={"errors": errors} + ) + ``` """ import json diff --git a/py/autoevals/ragas.py b/py/autoevals/ragas.py index 2e432fe..93950e6 100644 --- a/py/autoevals/ragas.py +++ b/py/autoevals/ragas.py @@ -20,7 +20,7 @@ - `model`: Model to use for evaluation, defaults to DEFAULT_RAGAS_MODEL (gpt-3.5-turbo-16k) - `client`: Optional Client for API calls. If not provided, uses global client from init() -**Example**: +**Example - Direct usage**: ```python from openai import OpenAI from autoevals import init @@ -37,7 +37,7 @@ result = relevancy.eval( input="What is the capital of France?", output="Paris is the capital of France", - context="Paris is the capital of France. The city is known for the Eiffel Tower." + context=["Paris is the capital of France.", "The city is known for the Eiffel Tower."] ) print(f"Context relevance score: {result.score}") # 1.0 for highly relevant @@ -46,11 +46,60 @@ result = faithfulness.eval( input="What is France's capital city?", output="Paris is the capital of France and has the Eiffel Tower", - context="Paris is the capital of France. The city is known for the Eiffel Tower." + context=["Paris is the capital of France.", "The city is known for the Eiffel Tower."] ) print(f"Faithfulness score: {result.score}") # 1.0 for fully supported claims ``` +**Example - Using with Braintrust Eval**: + ```python + from braintrust import Eval + from autoevals import init + from autoevals.ragas import Faithfulness, ContextRelevancy + from openai import OpenAI + + # Initialize autoevals + init(OpenAI()) + + # Dataset with context in metadata + dataset = [ + { + "input": "What is the capital of France?", + "expected": "Paris", + "metadata": { + "context": [ + "Paris is the capital of France.", + "Berlin is the capital of Germany." + ] + } + }, + # ... more examples + ] + + # Create scorer functions that extract context from metadata + def faithfulness_scorer(output, expected, input, metadata): + return Faithfulness().eval( + input=input, + output=output, + context=metadata.get("context", []) + ) + + def context_relevancy_scorer(output, expected, input, metadata): + return ContextRelevancy().eval( + input=input, + output=output, + context=metadata.get("context", []) + ) + + # Run evaluation + Eval( + "my-rag-eval", + data=dataset, + task=lambda input: generate_answer(input), # Your LLM function + scores=[faithfulness_scorer, context_relevancy_scorer] + ) + ``` + For more examples and detailed usage of each evaluator, see their individual class docstrings. """