Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Autoevals is a dual-language library (TypeScript + Python) for evaluating AI model outputs. It provides LLM-as-a-judge evaluations, heuristic scorers (Levenshtein distance), and statistical metrics (BLEU). Developed by Braintrust.

## Commands

### TypeScript (in root directory)

```bash
pnpm install # Install dependencies
pnpm run build # Build JS (outputs to jsdist/)
pnpm run test # Run all JS tests with vitest
pnpm run test -- js/llm.test.ts # Run single test file
pnpm run test -- -t "test name" # Run specific test by name
```

### Python (from root directory)

```bash
make develop # Set up Python venv with all dependencies
source env.sh # Activate the venv
pytest # Run all Python tests
pytest py/autoevals/test_llm.py # Run single test file
pytest py/autoevals/test_llm.py::test_openai # Run specific test
pytest -k "test_name" # Run tests matching pattern
```

### Linting

```bash
pre-commit run --all-files # Run all linters (black, ruff, prettier, codespell)
make fixup # Same as above
```

## Architecture

### Dual Implementation Pattern

The library maintains parallel implementations in TypeScript (`js/`) and Python (`py/autoevals/`). Both share:

- The same evaluation templates (`templates/*.yaml`)
- The same `Score` interface: `{name, score (0-1), metadata}`
- The same scorer names and behavior

### Key Modules (both languages)

- `llm.ts` / `llm.py` - LLM-as-a-judge scorers (Factuality, Battle, ClosedQA, Humor, Security, Sql, Summary, Translation)
- `ragas.ts` / `ragas.py` - RAG evaluation metrics (ContextRelevancy, Faithfulness, AnswerRelevancy, etc.)
- `string.ts` / `string.py` - Text similarity (Levenshtein, EmbeddingSimilarity)
- `json.ts` / `json.py` - JSON validation and diff
- `oai.ts` / `oai.py` - OpenAI client wrapper with caching
- `score.ts` / `score.py` - Core Score type and Scorer base class

### Template System

YAML templates in `templates/` define LLM classifier prompts. Templates use Mustache syntax (`{{variable}}`). The `LLMClassifier` class loads these templates and handles:

- Prompt rendering with chain-of-thought (CoT) suffix
- Tool-based response parsing via `select_choice` function
- Score mapping from choice letters to numeric scores

### Python Scorer Pattern

```python
class Scorer(ABC):
def eval(self, output, expected=None, **kwargs) -> Score # Sync
async def eval_async(self, output, expected=None, **kwargs) # Async
def __call__(...) # Alias for eval()
```

### TypeScript Scorer Pattern

```typescript
type Scorer<Output, Extra> = (
args: ScorerArgs<Output, Extra>,
) => Score | Promise<Score>;
// All scorers are async functions
```

## Environment Variables

Tests require:

- `OPENAI_API_KEY` or `BRAINTRUST_API_KEY` - For LLM-based evaluations
- `OPENAI_BASE_URL` (optional) - Custom API endpoint

## Testing Notes

- Python tests use `pytest` with `respx` for HTTP mocking
- TypeScript tests use `vitest` with `msw` for HTTP mocking
- Tests that call real LLM APIs need valid API keys
- Test files are colocated: `test_*.py` (Python), `*.test.ts` (TypeScript)
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -361,6 +361,8 @@ Eval(
- Numeric difference
- JSON diff

For detailed documentation on all scorers, including parameters, score ranges, and usage examples, see the [**Scorer Reference**](SCORERS.md).

## Custom evaluation prompts

Autoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:
Expand Down
Loading