Skip to content

rouapps/caret

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Caret

Rust 1.75+ MIT License Platform

Caret is a terminal tool for inspecting and cleaning large LLM training datasets. It handles JSONL, Parquet, and CSV files using memory-mapped I/O, and includes near-duplicate detection, token visualization, dataset linting, and an MCP server.

Quick Start

git clone https://github.com/rouapps/caret.git
cd caret && cargo build --release

caret data.jsonl                          # JSONL (memory-mapped)
caret data.parquet                        # Parquet (Arrow-native)
caret data.csv                            # CSV
caret hf://tatsu-lab/alpaca               # Stream from HuggingFace (no download)
caret data.jsonl --mcp-port 3100          # Start MCP server alongside TUI

How It Works

Caret memory-maps the file via memmap2 and builds a byte-offset index of line boundaries. Line access is O(1) -- the OS page cache handles the rest. Data is never copied into userspace; Caret slices directly into the mapped region.

For remote HuggingFace datasets, Caret fetches only the Parquet footer metadata via HTTP Range requests, then loads row-groups on demand as you scroll.

Features

  • Memory-mapped I/O -- files of any size open instantly with near-zero RSS
  • Near-duplicate detection -- SimHash fingerprinting with hardware POPCNT, parallelized via rayon
  • HuggingFace Hub streaming -- browse remote datasets without downloading them
  • MCP server -- expose dataset tools to Claude Desktop, Cursor, or any MCP client
  • Token X-Ray -- visualize tokenization boundaries (Tiktoken, HuggingFace, GPT-2)
  • Dataset linter -- catch unbalanced <think> tags, invalid JSON, missing keys
  • Auto-fix -- repair common formatting issues in JSONL datasets
  • Detail panel -- split-screen pretty-printed JSON view
  • Pipeline support -- reads from stdin (cat data.jsonl | caret -)

MCP Server

Caret implements the Model Context Protocol, exposing these tools:

Tool Description
search_dataset Regex search across the dataset
dataset_info Line count, file size, format metadata
get_lines Random access to any line range
dedup_scan SimHash dedup with statistics
jump_to_line Navigate TUI to a specific line
toggle_view Cycle view mode (Text / Token X-Ray / Tree)
show_detail Show/hide detail panel
caret data.jsonl --mcp-port 3100          # TUI + MCP server
caret data.jsonl --mcp-only               # Headless (for CI/pipelines)

The TUI control tools (jump_to_line, toggle_view, show_detail) allow AI assistants to interactively navigate the dataset while you watch — ask Claude or Gemini to "jump to line 500 and show the tokens" and the TUI responds instantly.

To use with Claude Desktop, add to claude_desktop_config.json:

{
  "mcpServers": {
    "caret": {
      "command": "/path/to/caret",
      "args": ["your_dataset.jsonl", "--mcp-only", "--mcp-port", "3100"]
    }
  }
}

HuggingFace Hub Streaming

caret hf://allenai/c4                     # Default split
caret hf://tatsu-lab/alpaca/train         # Specific split
caret hf://allenai/c4/en/validation       # Config + split

Caret issues a HEAD request to get the file size, fetches the 4-byte Parquet footer length, reads the Thrift metadata (a few KB), then loads only the row-groups you scroll to. The first page appears in under a second.

Deduplication

caret data.jsonl --dedup                              # Scan and report
caret data.jsonl --dedup --dedup-export clean.jsonl   # Export unique lines
caret data.jsonl --dedup --dedup-strategy exact        # Exact match only
caret data.jsonl --dedup --dedup-threshold 5           # Aggressive fuzzy (0-10, default 3)

Press D in the TUI to run an interactive dedup scan. Duplicates are highlighted with a DUP badge.

The engine works in two phases: parallel fingerprinting (workers read directly from the mmap, hash content into 64-bit SimHash via FNV-1a shingles) followed by index construction (each fingerprint compared via XOR + POPCNT). Duplicates are tracked in a compact bitmask -- 1 billion lines takes ~125 MB.

Keyboard Shortcuts

Key Action
j / Down Move down
k / Up Move up
g / G Top / Bottom
Ctrl+d / Ctrl+u Page down / up
Tab Cycle view: Text / Token X-Ray / Tree
Enter Toggle detail panel
D Toggle dedup scan
? Help
q Quit

Usage Reference

# Local files
caret data.jsonl                                      # Auto-detect format
caret data.parquet                                    # Parquet
caret data.csv                                        # CSV
caret data.txt --format jsonl                         # Force format

# HuggingFace
caret hf://org/dataset                                # Default split (train)
caret hf://org/dataset/validation                     # Specific split
caret hf://org/dataset/config/split                   # Config + split

# MCP server
caret data.jsonl --mcp-port 3100                      # TUI + MCP
caret data.jsonl --mcp-only                           # Headless
caret data.jsonl --mcp-only --mcp-port 8080           # Custom port

# Deduplication
caret data.jsonl --dedup                              # Scan and report
caret data.jsonl --dedup --dedup-export clean.jsonl   # Export unique lines
caret data.jsonl --dedup --dedup-strategy exact        # Exact match
caret data.jsonl --dedup --dedup-threshold 5           # Aggressive fuzzy

# Linting
caret data.jsonl --lint
caret data.jsonl --lint --required-keys "messages,prompt"

# Token visualization
caret data.jsonl -t                                   # Tiktoken cl100k_base
caret data.jsonl -t --tiktoken-encoding p50k_base     # Codex encoding
caret data.jsonl -t --tokenizer-type huggingface      # Llama 3.1
caret data.jsonl --tokenizer-path ./my-tokenizer.json # Custom tokenizer

# Auto-fix
caret data.jsonl --fix                                # Creates data_fixed.jsonl
caret data.jsonl --fix -o output.jsonl                # Custom output
caret data.jsonl --fix --fix-in-place                 # Overwrite original

# Pipeline
cat data.jsonl | caret -

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                        Caret TUI (Ratatui)                       │
├──────────┬──────────┬──────────────┬──────────┬─────────────────┤
│ Dataset  │Tokenizer │   Linter     │  Dedup   │   MCP Server    │
│  (mmap)  │(Tiktoken)│(Regex+JSON)  │(SimHash) │  (Axum/Tokio)   │
├──────────┼──────────┴──────────────┴──────────┤─────────────────┤
│ HF Stream│     memmap2 · serde_json · rayon    │ reqwest · axum  │
│ (Range)  │                                     │ tower · tracing │
└──────────┴─────────────────────────────────────┴─────────────────┘

Contributing

Contributions welcome. See issues labeled good first issue.

cargo run -- test_data.jsonl              # Development
cargo test                                # Tests
cargo build --release                     # Optimized build
RUST_LOG=caret=debug cargo run -- test_data.jsonl  # Debug logging

Requirements

  • Rust 1.75+
  • A terminal with 256-color support

License

MIT -- see LICENSE.

About

Terminal tool for inspecting and cleaning large LLM training datasets. Handles JSONL, Parquet, and CSV with memory-mapped I/O, near-duplicate detection, token visualization, dataset linting, and an MCP server.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages