diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..d17cb16
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,96 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+Autoevals is a dual-language library (TypeScript + Python) for evaluating AI model outputs. It provides LLM-as-a-judge evaluations, heuristic scorers (Levenshtein distance), and statistical metrics (BLEU). Developed by Braintrust.
+
+## Commands
+
+### TypeScript (in root directory)
+
+```bash
+pnpm install          # Install dependencies
+pnpm run build        # Build JS (outputs to jsdist/)
+pnpm run test         # Run all JS tests with vitest
+pnpm run test -- js/llm.test.ts                    # Run single test file
+pnpm run test -- -t "test name"                    # Run specific test by name
+```
+
+### Python (from root directory)
+
+```bash
+make develop          # Set up Python venv with all dependencies
+source env.sh         # Activate the venv
+pytest                # Run all Python tests
+pytest py/autoevals/test_llm.py                    # Run single test file
+pytest py/autoevals/test_llm.py::test_openai       # Run specific test
+pytest -k "test_name"                              # Run tests matching pattern
+```
+
+### Linting
+
+```bash
+pre-commit run --all-files    # Run all linters (black, ruff, prettier, codespell)
+make fixup                    # Same as above
+```
+
+## Architecture
+
+### Dual Implementation Pattern
+
+The library maintains parallel implementations in TypeScript (`js/`) and Python (`py/autoevals/`). Both share:
+
+- The same evaluation templates (`templates/*.yaml`)
+- The same `Score` interface: `{name, score (0-1), metadata}`
+- The same scorer names and behavior
+
+### Key Modules (both languages)
+
+- `llm.ts` / `llm.py` - LLM-as-a-judge scorers (Factuality, Battle, ClosedQA, Humor, Security, Sql, Summary, Translation)
+- `ragas.ts` / `ragas.py` - RAG evaluation metrics (ContextRelevancy, Faithfulness, AnswerRelevancy, etc.)
+- `string.ts` / `string.py` - Text similarity (Levenshtein, EmbeddingSimilarity)
+- `json.ts` / `json.py` - JSON validation and diff
+- `oai.ts` / `oai.py` - OpenAI client wrapper with caching
+- `score.ts` / `score.py` - Core Score type and Scorer base class
+
+### Template System
+
+YAML templates in `templates/` define LLM classifier prompts. Templates use Mustache syntax (`{{variable}}`). The `LLMClassifier` class loads these templates and handles:
+
+- Prompt rendering with chain-of-thought (CoT) suffix
+- Tool-based response parsing via `select_choice` function
+- Score mapping from choice letters to numeric scores
+
+### Python Scorer Pattern
+
+```python
+class Scorer(ABC):
+    def eval(self, output, expected=None, **kwargs) -> Score      # Sync
+    async def eval_async(self, output, expected=None, **kwargs)   # Async
+    def __call__(...)  # Alias for eval()
+```
+
+### TypeScript Scorer Pattern
+
+```typescript
+type Scorer<Output, Extra> = (
+  args: ScorerArgs<Output, Extra>,
+) => Score | Promise<Score>;
+// All scorers are async functions
+```
+
+## Environment Variables
+
+Tests require:
+
+- `OPENAI_API_KEY` or `BRAINTRUST_API_KEY` - For LLM-based evaluations
+- `OPENAI_BASE_URL` (optional) - Custom API endpoint
+
+## Testing Notes
+
+- Python tests use `pytest` with `respx` for HTTP mocking
+- TypeScript tests use `vitest` with `msw` for HTTP mocking
+- Tests that call real LLM APIs need valid API keys
+- Test files are colocated: `test_*.py` (Python), `*.test.ts` (TypeScript)
diff --git a/README.md b/README.md
index fb5e907..a5eda5d 100644
--- a/README.md
+++ b/README.md
@@ -361,6 +361,8 @@ Eval(
 - Numeric difference
 - JSON diff
 
+For detailed documentation on all scorers, including parameters, score ranges, and usage examples, see the [**Scorer Reference**](SCORERS.md).
+
 ## Custom evaluation prompts
 
 Autoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:
diff --git a/SCORERS.md b/SCORERS.md
new file mode 100644
index 0000000..9515142
--- /dev/null
+++ b/SCORERS.md
@@ -0,0 +1,628 @@
+# Autoevals scorer reference
+
+Complete reference for all scorers available in Autoevals, including parameters, score ranges, and usage examples.
+
+## Table of contents
+
+- [LLM-as-a-judge scorers](#llm-as-a-judge-scorers)
+- [RAG (Retrieval-Augmented Generation) scorers](#rag-retrieval-augmented-generation-scorers)
+- [Heuristic scorers](#heuristic-scorers)
+- [JSON scorers](#json-scorers)
+- [List scorers](#list-scorers)
+
+---
+
+## LLM-as-a-judge scorers
+
+These scorers use language models to evaluate outputs based on semantic understanding.
+
+### Factuality
+
+Evaluates whether the output is factually consistent with the expected answer.
+
+**Parameters:**
+
+- `input` (string): The input question or prompt
+- `output` (string, required): The generated answer to evaluate
+- `expected` (string, required): The ground truth answer
+- `model` (string, optional): Model to use (default: configured via `init()` or "gpt-4o")
+- `client` (Client, optional): Custom OpenAI client
+
+**Score Range:** 0-1
+
+- `1.0` = Output is factually accurate
+- `0.0` = Output contains factual errors
+
+**Example:**
+
+```typescript
+import { Factuality } from "autoevals";
+
+const result = await Factuality({
+  input: "What is the capital of France?",
+  output: "Paris",
+  expected: "The capital of France is Paris",
+});
+// Score: 1.0 (factually correct)
+```
+
+### Battle
+
+Compares two outputs and determines which one is better.
+
+**Parameters:**
+
+- `input` (string): The input question or prompt
+- `output` (string, required): First answer to compare
+- `expected` (string, required): Second answer to compare
+- `model` (string, optional): Model to use
+- `client` (Client, optional): Custom OpenAI client
+
+**Score Range:** 0-1
+
+- `1.0` = Output is significantly better than expected
+- `0.5` = Both outputs are roughly equal
+- `0.0` = Expected is significantly better than output
+
+**Example:**
+
+```python
+from autoevals.llm import Battle
+
+evaluator = Battle()
+result = evaluator.eval(
+    input="Explain photosynthesis",
+    output="Plants use sunlight to make food from CO2 and water",
+    expected="Photosynthesis is a process"
+)
+# Score: ~1.0 (first answer is more detailed)
+```
+
+### ClosedQA
+
+Evaluates answers to closed-ended questions where there's a clear correct answer.
+
+**Parameters:**
+
+- `input` (string): The question
+- `output` (string, required): The generated answer
+- `expected` (string, required): The correct answer
+- `model` (string, optional): Model to use
+- `criteria` (string, optional): Custom evaluation criteria
+
+**Score Range:** 0-1
+
+- `1.0` = Answer is correct
+- `0.0` = Answer is incorrect
+
+### Humor
+
+Evaluates whether the output is humorous.
+
+**Parameters:**
+
+- `input` (string): The context or setup
+- `output` (string, required): The text to evaluate for humor
+- `model` (string, optional): Model to use
+
+**Score Range:** 0-1
+
+- `1.0` = Very humorous
+- `0.0` = Not humorous
+
+### Security
+
+Evaluates whether the output contains security vulnerabilities or unsafe content.
+
+**Parameters:**
+
+- `output` (string, required): The content to evaluate
+- `model` (string, optional): Model to use
+
+**Score Range:** 0-1
+
+- `1.0` = No security concerns
+- `0.0` = Contains security vulnerabilities
+
+### Moderation
+
+Evaluates content for policy violations using OpenAI's moderation API.
+
+**Parameters:**
+
+- `output` (string, required): The content to moderate
+- `client` (Client, optional): Custom OpenAI client
+
+**Score Range:** 0-1
+
+- `1.0` = Content is safe
+- `0.0` = Content violates policies
+
+**Categories Checked:**
+
+- Sexual content
+- Hate speech
+- Harassment
+- Self-harm
+- Violence
+- Sexual content involving minors
+
+### Sql
+
+Evaluates SQL query correctness and quality.
+
+**Parameters:**
+
+- `input` (string): The natural language question
+- `output` (string, required): The generated SQL query
+- `expected` (string, optional): The correct SQL query
+- `model` (string, optional): Model to use
+
+**Score Range:** 0-1
+
+### Summary
+
+Evaluates the quality of text summaries.
+
+**Parameters:**
+
+- `input` (string): The original text
+- `output` (string, required): The generated summary
+- `expected` (string, optional): A reference summary
+- `model` (string, optional): Model to use
+
+**Score Range:** 0-1
+
+- `1.0` = Excellent summary (accurate, concise, complete)
+- `0.0` = Poor summary
+
+### Translation
+
+Evaluates translation quality.
+
+**Parameters:**
+
+- `input` (string): The source text
+- `output` (string, required): The generated translation
+- `expected` (string, optional): A reference translation
+- `model` (string, optional): Model to use
+
+**Score Range:** 0-1
+
+- `1.0` = Excellent translation
+- `0.0` = Poor translation
+
+---
+
+## RAG (Retrieval-Augmented Generation) scorers
+
+These scorers evaluate RAG systems by assessing both context retrieval and answer generation quality.
+
+All RAG scorers support passing `context` through the `metadata` parameter when used with Braintrust Eval. See the [RAGAS module documentation](js/ragas.ts) for examples.
+
+### ContextRelevancy
+
+Evaluates how relevant the retrieved context is to the input question.
+
+**Parameters:**
+
+- `input` (string, required): The question
+- `output` (string, required): The generated answer
+- `context` (string[] | string, required): Retrieved context passages
+- `model` (string, optional): Model to use (default: "gpt-4o-mini")
+
+**Score Range:** 0-1
+
+- `1.0` = All context is highly relevant
+- `0.0` = Context is irrelevant
+
+**Example:**
+
+```python
+from autoevals.ragas import ContextRelevancy
+
+scorer = ContextRelevancy()
+result = scorer.eval(
+    input="What is the capital of France?",
+    output="Paris",
+    context=[
+        "Paris is the capital of France.",
+        "Berlin is the capital of Germany."
+    ]
+)
+# Score: ~0.5 (only first context item is relevant)
+```
+
+### ContextRecall
+
+Measures how well the context supports the expected answer.
+
+**Parameters:**
+
+- `input` (string): The question
+- `expected` (string, required): The ground truth answer
+- `context` (string[] | string, required): Retrieved context passages
+- `model` (string, optional): Model to use
+
+**Score Range:** 0-1
+
+- `1.0` = Context fully supports the expected answer
+- `0.0` = Context doesn't support the expected answer
+
+### ContextPrecision
+
+Measures the precision of retrieved context - whether relevant context appears before irrelevant context.
+
+**Parameters:**
+
+- `input` (string, required): The question
+- `expected` (string, required): The ground truth answer
+- `context` (string[] | string, required): Retrieved context passages (order matters)
+- `model` (string, optional): Model to use
+
+**Score Range:** 0-1
+
+- `1.0` = All relevant context appears first
+- `0.0` = Relevant context is buried under irrelevant context
+
+### ContextEntityRecall
+
+Measures how well the context contains entities from the expected answer.
+
+**Parameters:**
+
+- `expected` (string, required): The ground truth answer
+- `context` (string[] | string, required): Retrieved context passages
+- `model` (string, optional): Model to use
+
+**Score Range:** 0-1
+
+- `1.0` = All entities from expected answer are in context
+- `0.0` = No entities from expected answer are in context
+
+### Faithfulness
+
+Evaluates whether the generated answer's claims are supported by the context.
+
+**Parameters:**
+
+- `input` (string): The question
+- `output` (string, required): The generated answer
+- `context` (string[] | string, required): Retrieved context passages
+- `model` (string, optional): Model to use
+
+**Score Range:** 0-1
+
+- `1.0` = All claims in the answer are supported by context
+- `0.0` = Answer contains unsupported claims (hallucinations)
+
+**Example:**
+
+```typescript
+import { Faithfulness } from "autoevals/ragas";
+
+const result = await Faithfulness({
+  input: "What is photosynthesis?",
+  output:
+    "Photosynthesis is how plants make food using sunlight and also they can fly.",
+  context: [
+    "Photosynthesis is the process by which plants use sunlight to synthesize foods.",
+  ],
+});
+// Score: ~0.5 (first claim supported, "can fly" is not)
+```
+
+### AnswerRelevancy
+
+Measures how relevant the answer is to the question.
+
+**Parameters:**
+
+- `input` (string, required): The question
+- `output` (string, required): The generated answer
+- `context` (string[] | string, optional): Retrieved context passages
+- `model` (string, optional): Model to use
+- `embedding_model` (string, optional): Model for embeddings (default: "text-embedding-3-small")
+
+**Score Range:** 0-1
+
+- `1.0` = Answer directly addresses the question
+- `0.0` = Answer is off-topic
+
+### AnswerSimilarity
+
+Compares semantic similarity between the generated answer and expected answer using embeddings.
+
+**Parameters:**
+
+- `output` (string, required): The generated answer
+- `expected` (string, required): The ground truth answer
+- `model` (string, optional): Embedding model to use (default: "text-embedding-3-small")
+
+**Score Range:** 0-1
+
+- `1.0` = Answers are semantically identical
+- `0.0` = Answers are completely different
+
+### AnswerCorrectness
+
+Combines factual correctness and semantic similarity to evaluate answers.
+
+**Parameters:**
+
+- `input` (string, required): The question
+- `output` (string, required): The generated answer
+- `expected` (string, required): The ground truth answer
+- `model` (string, optional): Model for factuality checking
+- `embedding_model` (string, optional): Model for similarity (default: "text-embedding-3-small")
+- `factuality_weight` (number, optional): Weight for factuality (default: 0.75)
+- `answer_similarity_weight` (number, optional): Weight for similarity (default: 0.25)
+
+**Score Range:** 0-1
+
+**Formula:** `score = (factuality_weight × factuality_score + answer_similarity_weight × similarity_score) / (factuality_weight + answer_similarity_weight)`
+
+---
+
+## Heuristic scorers
+
+Fast, deterministic scorers that don't use LLMs.
+
+### Levenshtein
+
+Calculates Levenshtein (edit) distance between strings, normalized to 0-1.
+
+**Parameters:**
+
+- `output` (string, required): The generated text
+- `expected` (string, required): The expected text
+
+**Score Range:** 0-1
+
+- `1.0` = Strings are identical
+- `0.0` = Strings are completely different
+
+**Example:**
+
+```python
+from autoevals.string import Levenshtein
+
+scorer = Levenshtein()
+result = scorer.eval(output="hello", expected="helo")
+# Score: ~0.8 (1 character difference)
+```
+
+### ExactMatch
+
+Binary scorer that checks for exact string equality.
+
+**Parameters:**
+
+- `output` (any, required): The generated value
+- `expected` (any, required): The expected value
+
+**Score Range:** 0 or 1
+
+- `1` = Values are exactly equal
+- `0` = Values differ
+
+### NumericDiff
+
+Evaluates numeric differences with configurable thresholds.
+
+**Parameters:**
+
+- `output` (number, required): The generated number
+- `expected` (number, required): The expected number
+- `max_diff` (number, optional): Maximum acceptable difference (default: 0)
+- `relative` (boolean, optional): Use relative difference (default: false)
+
+**Score Range:** 0-1
+
+- `1.0` = Numbers are equal (within threshold)
+- `0.0` = Numbers differ significantly
+
+**Formula (absolute):** `score = max(0, 1 - |output - expected| / max_diff)` (when max_diff > 0)
+
+**Formula (relative):** `score = max(0, 1 - |output - expected| / |expected|)`
+
+**Example:**
+
+```typescript
+import { NumericDiff } from "autoevals";
+
+// Absolute difference
+const result1 = await NumericDiff({
+  output: 10.5,
+  expected: 10.0,
+  maxDiff: 1.0,
+});
+// Score: 0.5 (difference of 0.5 out of max 1.0)
+
+// Relative difference
+const result2 = await NumericDiff({
+  output: 100,
+  expected: 110,
+  relative: true,
+});
+// Score: ~0.91 (10% difference)
+```
+
+### EmbeddingSimilarity
+
+Compares semantic similarity using text embeddings (cosine similarity).
+
+**Parameters:**
+
+- `output` (string, required): The generated text
+- `expected` (string, required): The expected text
+- `model` (string, optional): Embedding model (default: "text-embedding-3-small")
+- `client` (Client, optional): Custom OpenAI client
+
+**Score Range:** -1 to 1 (typically 0-1 for text)
+
+- `1.0` = Semantically identical
+- `0.0` = Unrelated
+- `-1.0` = Opposite meanings (rare)
+
+---
+
+## JSON scorers
+
+Scorers for evaluating JSON outputs.
+
+### JSONDiff
+
+Recursively compares JSON objects with customizable string and number comparison.
+
+**Parameters:**
+
+- `output` (any, required): The generated JSON
+- `expected` (any, required): The expected JSON
+- `string_scorer` (Scorer, optional): Scorer for string values (default: Levenshtein)
+- `number_scorer` (Scorer, optional): Scorer for numeric values (default: NumericDiff)
+- `preserve_strings` (boolean, optional): Don't auto-parse JSON strings (default: false)
+
+**Score Range:** 0-1
+
+- `1.0` = JSON structures are identical
+- `0.0` = JSON structures are completely different
+
+**Example:**
+
+```python
+from autoevals.json import JSONDiff
+
+scorer = JSONDiff()
+result = scorer.eval(
+    output={"name": "John", "age": 30},
+    expected={"name": "John", "age": 31}
+)
+# Score: ~0.5 (name matches, age differs slightly)
+```
+
+### ValidJSON
+
+Validates JSON syntax and optionally checks against a JSON Schema.
+
+**Parameters:**
+
+- `output` (any, required): The value to validate
+- `schema` (object, optional): JSON Schema to validate against
+
+**Score Range:** 0 or 1
+
+- `1` = Valid JSON (and matches schema if provided)
+- `0` = Invalid JSON or doesn't match schema
+
+**Example:**
+
+```typescript
+import { ValidJSON } from "autoevals/json";
+
+const schema = {
+  type: "object",
+  properties: {
+    name: { type: "string" },
+    age: { type: "number" },
+  },
+  required: ["name", "age"],
+};
+
+const result = await ValidJSON({
+  output: '{"name": "John", "age": 30}',
+  schema,
+});
+// Score: 1 (valid JSON matching schema)
+```
+
+---
+
+## List scorers
+
+Scorers for evaluating lists and arrays.
+
+### ListContains
+
+Checks if all expected items are present in the output list.
+
+**Parameters:**
+
+- `output` (any[], required): The generated list
+- `expected` (any[], required): Items that should be present
+- `scorer` (Scorer, optional): Scorer for comparing individual items
+
+**Score Range:** 0-1
+
+- `1.0` = All expected items are present
+- `0.0` = None of the expected items are present
+
+**Example:**
+
+```python
+from autoevals.list import ListContains
+
+scorer = ListContains()
+result = scorer.eval(
+    output=["apple", "banana", "cherry"],
+    expected=["apple", "banana"]
+)
+# Score: 1.0 (both expected items present)
+```
+
+---
+
+## Custom scorers
+
+You can create custom scorers for domain-specific evaluation needs. See:
+
+- [JSON scorer examples](py/autoevals/json.py) - Combining validators and comparators
+- [Creating custom scorers](README.md#creating-custom-scorers) - Basic custom scorer pattern
+
+---
+
+## Score interpretation
+
+General guidelines for interpreting scores:
+
+- **1.0**: Perfect match or complete correctness
+- **0.8-0.99**: Very good, minor differences
+- **0.6-0.79**: Acceptable, some issues
+- **0.4-0.59**: Moderate quality, significant issues
+- **0.2-0.39**: Poor quality, major issues
+- **0.0-0.19**: Unacceptable or completely wrong
+
+Note: Interpretation varies by scorer type. Binary scorers (ExactMatch, ValidJSON) only return 0 or 1.
+
+---
+
+## Common parameters
+
+Many scorers share these common parameters:
+
+- `model` (string): LLM model to use for evaluation (default: configured via `init()` or "gpt-4o")
+- `client` (Client): Custom OpenAI-compatible client
+- `use_cot` (boolean): Enable chain-of-thought reasoning for LLM scorers (default: true)
+- `temperature` (number): LLM temperature setting
+- `max_tokens` (number): Maximum tokens for LLM response
+
+## Configuring defaults
+
+Use `init()` to configure default settings for all scorers:
+
+```typescript
+import { init } from "autoevals";
+import OpenAI from "openai";
+
+init({
+  client: new OpenAI({ apiKey: "..." }),
+  defaultModel: "gpt-4o",
+});
+```
+
+```python
+from autoevals import init
+from openai import OpenAI
+
+init(OpenAI(api_key="..."), default_model="gpt-4o")
+```
diff --git a/js/json.ts b/js/json.ts
index 82cb646..50be17a 100644
--- a/js/json.ts
+++ b/js/json.ts
@@ -1,3 +1,73 @@
+/**
+ * JSON evaluation scorers for comparing and validating JSON data.
+ *
+ * This module provides scorers for working with JSON data:
+ *
+ * - **JSONDiff**: Compare JSON objects for structural and content similarity
+ * - **ValidJSON**: Validate if a value is valid JSON and matches an optional schema
+ *
+ * ## Creating Custom JSON Scorers
+ *
+ * You can create custom JSON scorers by composing existing scorers or building new ones:
+ *
+ * @example
+ * ```typescript
+ * import { Scorer } from "autoevals";
+ * import { JSONDiff, ValidJSON } from "autoevals/json";
+ * import { EmbeddingSimilarity } from "autoevals/string";
+ *
+ * // Custom scorer that validates JSON schema then compares semantically
+ * const myJSONScorer: Scorer<any, { schema?: any }> = async ({ output, expected, schema }) => {
+ *   // First, validate both outputs against schema
+ *   const outputValid = await ValidJSON({ output, schema });
+ *   const expectedValid = await ValidJSON({ output: expected, schema });
+ *
+ *   if (outputValid.score === 0 || expectedValid.score === 0) {
+ *     return {
+ *       name: "CustomJSONScorer",
+ *       score: 0,
+ *       error: "Invalid JSON format"
+ *     };
+ *   }
+ *
+ *   // Then compare using semantic similarity for strings
+ *   return JSONDiff({
+ *     output,
+ *     expected,
+ *     stringScorer: EmbeddingSimilarity
+ *   });
+ * };
+ *
+ * // Custom scorer for specific JSON structure validation
+ * const apiResponseScorer: Scorer<any, object> = async ({ output }) => {
+ *   const parsed = typeof output === "string" ? JSON.parse(output) : output;
+ *
+ *   let score = 0;
+ *   const errors: string[] = [];
+ *
+ *   // Check required fields
+ *   if (parsed.status) score += 0.3;
+ *   else errors.push("Missing status field");
+ *
+ *   if (parsed.data) score += 0.3;
+ *   else errors.push("Missing data field");
+ *
+ *   // Check data structure
+ *   if (parsed.data?.items && Array.isArray(parsed.data.items)) {
+ *     score += 0.4;
+ *   } else {
+ *     errors.push("data.items must be an array");
+ *   }
+ *
+ *   return {
+ *     name: "APIResponseScorer",
+ *     score: Math.min(score, 1),
+ *     metadata: { errors }
+ *   };
+ * };
+ * ```
+ */
+
 import { Scorer } from "./score";
 import { NumericDiff } from "./number";
 import { LevenshteinScorer } from "./string";
@@ -5,8 +75,48 @@ import Ajv, { JSONSchemaType, Schema } from "ajv";
 import { makePartial, ScorerWithPartial } from "./partial";
 
 /**
- * A simple scorer that compares JSON objects, using a customizable comparison method for strings
- * (defaults to Levenshtein) and numbers (defaults to NumericDiff).
+ * Compare JSON objects for structural and content similarity.
+ *
+ * This scorer recursively compares JSON objects, handling:
+ * - Nested dictionaries and arrays
+ * - String similarity using Levenshtein distance (or custom scorer)
+ * - Numeric value comparison (or custom scorer)
+ * - Automatic parsing of JSON strings
+ *
+ * @example
+ * ```typescript
+ * import { JSONDiff } from "autoevals";
+ * import { EmbeddingSimilarity } from "autoevals/string";
+ *
+ * // Basic comparison
+ * const result = await JSONDiff({
+ *   output: {
+ *     name: "John Smith",
+ *     age: 30,
+ *     skills: ["python", "javascript"]
+ *   },
+ *   expected: {
+ *     name: "John A. Smith",
+ *     age: 31,
+ *     skills: ["python", "typescript"]
+ *   }
+ * });
+ * console.log(result.score); // Similarity score between 0-1
+ *
+ * // With custom string scorer using embeddings
+ * const semanticResult = await JSONDiff({
+ *   output: { description: "A fast car" },
+ *   expected: { description: "A quick automobile" },
+ *   stringScorer: EmbeddingSimilarity
+ * });
+ * ```
+ *
+ * @param output - The JSON object or string to evaluate
+ * @param expected - The expected JSON object or string to compare against
+ * @param stringScorer - Optional custom scorer for string comparisons (default: LevenshteinScorer)
+ * @param numberScorer - Optional custom scorer for number comparisons (default: NumericDiff)
+ * @param preserveStrings - Don't attempt to parse strings as JSON (default: false)
+ * @returns Score object with similarity score between 0-1
  */
 export const JSONDiff: ScorerWithPartial<
   any,
@@ -38,8 +148,54 @@ export const JSONDiff: ScorerWithPartial<
 );
 
 /**
- * A binary scorer that evaluates the validity of JSON output, optionally validating against a
- * JSON Schema definition (see https://json-schema.org/learn/getting-started-step-by-step#create).
+ * Validate if a value is valid JSON and optionally matches a JSON Schema.
+ *
+ * This scorer checks if:
+ * - The input can be parsed as valid JSON (if it's a string)
+ * - The parsed JSON matches an optional JSON Schema
+ * - Handles both string inputs and pre-parsed JSON objects
+ *
+ * @example
+ * ```typescript
+ * import { ValidJSON } from "autoevals";
+ *
+ * // Basic JSON validation
+ * const result1 = await ValidJSON({
+ *   output: '{"name": "John", "age": 30}'
+ * });
+ * console.log(result1.score); // 1 (valid JSON)
+ *
+ * const result2 = await ValidJSON({
+ *   output: '{invalid json}'
+ * });
+ * console.log(result2.score); // 0 (invalid JSON)
+ *
+ * // With schema validation
+ * const schema = {
+ *   type: "object",
+ *   properties: {
+ *     name: { type: "string" },
+ *     age: { type: "number" }
+ *   },
+ *   required: ["name", "age"]
+ * };
+ *
+ * const result3 = await ValidJSON({
+ *   output: { name: "John", age: 30 },
+ *   schema
+ * });
+ * console.log(result3.score); // 1 (matches schema)
+ *
+ * const result4 = await ValidJSON({
+ *   output: { name: "John" }, // missing required "age"
+ *   schema
+ * });
+ * console.log(result4.score); // 0 (doesn't match schema)
+ * ```
+ *
+ * @param output - The value to validate (string or object)
+ * @param schema - Optional JSON Schema to validate against (see https://json-schema.org)
+ * @returns Score object with score of 1 if valid, 0 otherwise
  */
 export const ValidJSON: ScorerWithPartial<any, { schema?: any }> = makePartial(
   async ({ output, schema }) => {
diff --git a/js/ragas.ts b/js/ragas.ts
index d5a5285..a975ec0 100644
--- a/js/ragas.ts
+++ b/js/ragas.ts
@@ -1,4 +1,104 @@
-/*These metrics are ported, with some enhancements, from the [RAGAS](https://github.com/explodinggradients/ragas) project. */
+/**
+ * RAGAS (Retrieval-Augmented Generation Assessment) metrics for evaluating RAG systems.
+ *
+ * These metrics are ported, with some enhancements, from the [RAGAS](https://github.com/explodinggradients/ragas) project.
+ *
+ * ## Context Quality Evaluators
+ *
+ * - **ContextEntityRecall**: Measures how well context contains expected entities
+ * - **ContextRelevancy**: Evaluates relevance of context to question
+ * - **ContextRecall**: Checks if context supports expected answer
+ * - **ContextPrecision**: Measures precision of context relative to question
+ *
+ * ## Answer Quality Evaluators
+ *
+ * - **Faithfulness**: Checks if answer claims are supported by context
+ * - **AnswerRelevancy**: Measures answer relevance to question
+ * - **AnswerSimilarity**: Compares semantic similarity to expected answer
+ * - **AnswerCorrectness**: Evaluates factual correctness against ground truth
+ *
+ * @example
+ * // Direct usage
+ * import { init } from "autoevals";
+ * import { Faithfulness, ContextRelevancy } from "autoevals/ragas";
+ * import OpenAI from "openai";
+ *
+ * // Initialize with your OpenAI client
+ * init({ client: new OpenAI() });
+ *
+ * // Evaluate context relevance
+ * const relevancyResult = await ContextRelevancy({
+ *   input: "What is the capital of France?",
+ *   output: "Paris is the capital of France",
+ *   context: [
+ *     "Paris is the capital of France.",
+ *     "The city is known for the Eiffel Tower."
+ *   ]
+ * });
+ * console.log(relevancyResult.score); // 1.0 for highly relevant
+ *
+ * // Check answer faithfulness
+ * const faithfulnessResult = await Faithfulness({
+ *   input: "What is France's capital city?",
+ *   output: "Paris is the capital of France and has the Eiffel Tower",
+ *   context: [
+ *     "Paris is the capital of France.",
+ *     "The city is known for the Eiffel Tower."
+ *   ]
+ * });
+ * console.log(faithfulnessResult.score); // 1.0 for fully supported
+ *
+ * @example
+ * // Using with Braintrust Eval
+ * import { Eval } from "braintrust";
+ * import { init } from "autoevals";
+ * import { Faithfulness, ContextRelevancy } from "autoevals/ragas";
+ * import OpenAI from "openai";
+ *
+ * // Initialize autoevals
+ * init({ client: new OpenAI() });
+ *
+ * // Dataset with context in metadata
+ * const dataset = [
+ *   {
+ *     input: "What is the capital of France?",
+ *     expected: "Paris",
+ *     metadata: {
+ *       context: [
+ *         "Paris is the capital of France.",
+ *         "Berlin is the capital of Germany."
+ *       ]
+ *     }
+ *   },
+ *   // ... more examples
+ * ];
+ *
+ * // Create scorer functions that extract context from metadata
+ * const faithfulnessScorer = ({ output, input, metadata }) => {
+ *   return Faithfulness({
+ *     input,
+ *     output,
+ *     context: metadata.context || []
+ *   });
+ * };
+ *
+ * const contextRelevancyScorer = ({ output, input, metadata }) => {
+ *   return ContextRelevancy({
+ *     input,
+ *     output,
+ *     context: metadata.context || []
+ *   });
+ * };
+ *
+ * // Run evaluation
+ * Eval("my-rag-eval", {
+ *   data: () => dataset,
+ *   task: (input) => generateAnswer(input), // Your LLM function
+ *   scores: [faithfulnessScorer, contextRelevancyScorer]
+ * });
+ *
+ * @module ragas
+ */
 import mustache from "mustache";
 
 import { Scorer, ScorerArgs } from "./score";
diff --git a/py/autoevals/json.py b/py/autoevals/json.py
index 23147d0..a299e29 100644
--- a/py/autoevals/json.py
+++ b/py/autoevals/json.py
@@ -2,15 +2,85 @@
 
 This module provides scorers for working with JSON data:
 
-- JSONDiff: Compare JSON objects for structural and content similarity
+- **JSONDiff**: Compare JSON objects for structural and content similarity
   - Handles nested structures, strings, numbers
   - Customizable with different scorers for string and number comparisons
   - Can automatically parse JSON strings
 
-- ValidJSON: Validate if a string is valid JSON and matches an optional schema
+- **ValidJSON**: Validate if a string is valid JSON and matches an optional schema
   - Validates JSON syntax
   - Optional JSON Schema validation
   - Works with both strings and parsed objects
+
+Creating Custom JSON Scorers
+-----------------------------
+
+You can create custom JSON scorers by composing existing scorers or building new ones:
+
+Example 1: Combine schema validation with semantic comparison
+    ```python
+    from autoevals import Scorer, Score
+    from autoevals.json import JSONDiff, ValidJSON
+    from autoevals.string import EmbeddingSimilarity
+    from openai import OpenAI
+
+    class CustomJSONScorer(Scorer):
+        def __init__(self, schema=None):
+            self.schema = schema
+            self.validator = ValidJSON(schema=schema)
+            self.differ = JSONDiff(string_scorer=EmbeddingSimilarity(client=OpenAI()))
+
+        def _run_eval_sync(self, output, expected=None, **kwargs):
+            # First validate both against schema
+            output_valid = self.validator.eval(output)
+            expected_valid = self.validator.eval(expected)
+
+            if output_valid.score == 0 or expected_valid.score == 0:
+                return Score(
+                    name="CustomJSONScorer",
+                    score=0,
+                    error="Invalid JSON format"
+                )
+
+            # Then compare semantically
+            return self.differ.eval(output, expected)
+    ```
+
+Example 2: Custom scorer for API response validation
+    ```python
+    from autoevals import Scorer, Score
+    import json
+
+    class APIResponseScorer(Scorer):
+        def _run_eval_sync(self, output, **kwargs):
+            parsed = json.loads(output) if isinstance(output, str) else output
+
+            score = 0
+            errors = []
+
+            # Check required fields
+            if parsed.get("status"):
+                score += 0.3
+            else:
+                errors.append("Missing status field")
+
+            if parsed.get("data"):
+                score += 0.3
+            else:
+                errors.append("Missing data field")
+
+            # Check data structure
+            if isinstance(parsed.get("data", {}).get("items"), list):
+                score += 0.4
+            else:
+                errors.append("data.items must be an array")
+
+            return Score(
+                name="APIResponseScorer",
+                score=min(score, 1),
+                metadata={"errors": errors}
+            )
+    ```
 """
 
 import json
diff --git a/py/autoevals/ragas.py b/py/autoevals/ragas.py
index 2e432fe..93950e6 100644
--- a/py/autoevals/ragas.py
+++ b/py/autoevals/ragas.py
@@ -20,7 +20,7 @@
     - `model`: Model to use for evaluation, defaults to DEFAULT_RAGAS_MODEL (gpt-3.5-turbo-16k)
     - `client`: Optional Client for API calls. If not provided, uses global client from init()
 
-**Example**:
+**Example - Direct usage**:
     ```python
     from openai import OpenAI
     from autoevals import init
@@ -37,7 +37,7 @@
     result = relevancy.eval(
         input="What is the capital of France?",
         output="Paris is the capital of France",
-        context="Paris is the capital of France. The city is known for the Eiffel Tower."
+        context=["Paris is the capital of France.", "The city is known for the Eiffel Tower."]
     )
     print(f"Context relevance score: {result.score}")  # 1.0 for highly relevant
 
@@ -46,11 +46,60 @@
     result = faithfulness.eval(
         input="What is France's capital city?",
         output="Paris is the capital of France and has the Eiffel Tower",
-        context="Paris is the capital of France. The city is known for the Eiffel Tower."
+        context=["Paris is the capital of France.", "The city is known for the Eiffel Tower."]
     )
     print(f"Faithfulness score: {result.score}")  # 1.0 for fully supported claims
     ```
 
+**Example - Using with Braintrust Eval**:
+    ```python
+    from braintrust import Eval
+    from autoevals import init
+    from autoevals.ragas import Faithfulness, ContextRelevancy
+    from openai import OpenAI
+
+    # Initialize autoevals
+    init(OpenAI())
+
+    # Dataset with context in metadata
+    dataset = [
+        {
+            "input": "What is the capital of France?",
+            "expected": "Paris",
+            "metadata": {
+                "context": [
+                    "Paris is the capital of France.",
+                    "Berlin is the capital of Germany."
+                ]
+            }
+        },
+        # ... more examples
+    ]
+
+    # Create scorer functions that extract context from metadata
+    def faithfulness_scorer(output, expected, input, metadata):
+        return Faithfulness().eval(
+            input=input,
+            output=output,
+            context=metadata.get("context", [])
+        )
+
+    def context_relevancy_scorer(output, expected, input, metadata):
+        return ContextRelevancy().eval(
+            input=input,
+            output=output,
+            context=metadata.get("context", [])
+        )
+
+    # Run evaluation
+    Eval(
+        "my-rag-eval",
+        data=dataset,
+        task=lambda input: generate_answer(input),  # Your LLM function
+        scores=[faithfulness_scorer, context_relevancy_scorer]
+    )
+    ```
+
 For more examples and detailed usage of each evaluator, see their individual class docstrings.
 """