Current Version: 0.3.0 🎉
CLI tool to parse, chunk, and evaluate Markdown documents for Retrieval-Augmented Generation (RAG) pipelines with token-accurate chunking support.
Available on PyPI: https://pypi.org/project/rag-chunk/
- 📄 Parse and clean Markdown files
- ✂️ Multiple chunking strategies:
fixed-size: Split by fixed word/token countsliding-window: Overlapping chunks for context preservationparagraph: Natural paragraph boundaries
- 🎯 Token-based chunking with tiktoken (OpenAI models: GPT-3.5, GPT-4, etc.)
- 🎨 Model selection via
--tiktoken-modelflag - 📊 Recall-based evaluation with test JSON files
- 🌈 Beautiful CLI output with Rich tables
- 📈 Compare all strategies with
--strategy all - 💾 Export results as JSON or CSV
rag-chunk is actively developed! Here's the plan to move from a useful tool to a full-featured chunking workbench.
- Core CLI engine (
argparse) - Markdown (
.md) file parsing - Basic chunking strategies:
fixed-size,sliding-window, andparagraph(word-based) - Evaluation harness: calculate Recall score from a
test-file.json - Beautiful CLI output (
richtables) - Published on PyPI:
pip install rag-chunk
- Tiktoken Support: Added
--use-tiktokenflag for precise token-based chunking - Model Selection: Added
--tiktoken-modelto choose tokenization model (default:gpt-3.5-turbo) - Improved Documentation: Updated README with tiktoken usage examples and comparisons
- Enhanced Testing: Added comprehensive unit tests for token-based chunking
- Optional Dependencies: tiktoken available via
pip install rag-chunk[tiktoken]
- Recursive Character Splitting: Add LangChain's
RecursiveCharacterTextSplitterfor semantic chunking- Install with:
pip install rag-chunk[langchain] - Strategy:
--strategy recursive-character - Works with both word-based and tiktoken modes
- Install with:
- More File Formats: Support
.txtfiles - Additional Metrics: Add precision, F1-score, and chunk quality metrics
- Advanced Strategies: Hierarchical chunking, semantic similarity-based splitting
- Export Connectors: Direct integration with vector stores (Pinecone, Weaviate, Chroma)
- Benchmarking Mode: Automated strategy comparison with recommendations
- MLFlow Integration: Track experiments and chunking configurations
- Performance Optimization: Parallel processing for large document sets
pip install rag-chunk
## Features
- Parse and clean Markdown files in a folder
- Chunk text using fixed-size, sliding-window, or paragraph-based strategies
- Evaluate chunk recall based on a provided test JSON file
- Output results as table, JSON, or CSV
- Store generated chunks temporarily in `.chunks`
## Installation
```bash
pip install rag-chunkFor token-based chunking with tiktoken support:
pip install rag-chunk[tiktoken]Or install from source:
pip install .Development mode:
pip install -e .
pip install -e .[tiktoken] # with tiktoken supportrag-chunk analyze examples/ --strategy all --chunk-size 150 --overlap 30 --test-file examples/questions.json --top-k 3 --output tablerag-chunk analyze <folder> [options]| Option | Description | Default |
|---|---|---|
--strategy |
Chunking strategy: fixed-size, sliding-window, paragraph, or all |
fixed-size |
--chunk-size |
Number of words or tokens per chunk | 200 |
--overlap |
Number of overlapping words or tokens (for sliding-window) | 50 |
--use-tiktoken |
Use tiktoken for precise token-based chunking (requires pip install rag-chunk[tiktoken]) |
False |
--test-file |
Path to JSON test file with questions | None |
--top-k |
Number of chunks to retrieve per question | 3 |
--output |
Output format: table, json, or csv |
table |
If --strategy all is chosen, every strategy is run with the supplied chunk-size and overlap where applicable.
Analyze markdown files and generate chunks without evaluation:
rag-chunk analyze examples/ --strategy paragraphOutput:
strategy | chunks | avg_recall | saved
----------+--------+------------+----------------------------------
paragraph | 12 | 0.0 | .chunks/paragraph-20251115-020145
Total text length (chars): 3542
Run all chunking strategies with custom parameters:
rag-chunk analyze examples/ --strategy all --chunk-size 100 --overlap 20 --output tableOutput:
strategy | chunks | avg_recall | saved
---------------+--------+------------+---------------------------------------
fixed-size | 36 | 0.0 | .chunks/fixed-size-20251115-020156
sliding-window | 45 | 0.0 | .chunks/sliding-window-20251115-020156
paragraph | 12 | 0.0 | .chunks/paragraph-20251115-020156
Total text length (chars): 3542
Measure recall using a test file with questions and relevant phrases:
rag-chunk analyze examples/ --strategy all --chunk-size 150 --overlap 30 --test-file examples/questions.json --top-k 3 --output tableOutput:
strategy | chunks | avg_recall | saved
---------------+--------+------------+---------------------------------------
fixed-size | 24 | 0.7812 | .chunks/fixed-size-20251115-020203
sliding-window | 32 | 0.8542 | .chunks/sliding-window-20251115-020203
paragraph | 12 | 0.9167 | .chunks/paragraph-20251115-020203
Paragraph-based chunking achieves highest recall (91.67%) because it preserves semantic boundaries in well-structured documents.
rag-chunk analyze examples/ --strategy sliding-window --chunk-size 120 --overlap 40 --test-file examples/questions.json --top-k 5 --output json > results.jsonOutput structure:
{
"results": [
{
"strategy": "sliding-window",
"chunks": 38,
"avg_recall": 0.8958,
"saved": ".chunks/sliding-window-20251115-020210"
}
],
"detail": {
"sliding-window": [
{
"question": "What are the three main stages of a RAG pipeline?",
"recall": 1.0
},
{
"question": "What is the main advantage of RAG over pure generative models?",
"recall": 0.6667
}
]
}
}rag-chunk analyze examples/ --strategy all --test-file examples/questions.json --output csvCreates analysis_results.csv with columns: strategy, chunks, avg_recall, saved.
By default, rag-chunk uses word-based tokenization (whitespace splitting). For precise token-level chunking that matches LLM context limits (e.g., GPT-3.5/GPT-4), use the --use-tiktoken flag.
pip install rag-chunk[tiktoken]Token-based fixed-size chunking:
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 512 --use-tiktoken --output tableThis creates chunks of exactly 512 tokens (as counted by tiktoken for GPT models), not 512 words.
Compare word-based vs token-based chunking:
# Word-based (default)
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 200 --output json
# Token-based
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 200 --use-tiktoken --output jsonToken-based with sliding window:
rag-chunk analyze examples/ --strategy sliding-window --chunk-size 1024 --overlap 128 --use-tiktoken --test-file examples/questions.json --top-k 3-
✅ Use tiktoken when:
- Preparing chunks for OpenAI models (GPT-3.5, GPT-4)
- You need to respect strict token limits (e.g., 8k, 16k context windows)
- Comparing chunking strategies with token-accurate measurements
- Your documents contain special characters, emojis, or non-ASCII text
-
⚠️ Use word-based (default) when:- Quick prototyping and testing
- Working with well-formatted English text
- Don't need exact token counts
- Want to avoid the tiktoken dependency
You can also use tiktoken in your own scripts:
from src.chunker import count_tokens
text = "Your document text here..."
# Word-based count
word_count = count_tokens(text, use_tiktoken=False)
print(f"Words: {word_count}")
# Token-based count (requires tiktoken installed)
token_count = count_tokens(text, use_tiktoken=True)
print(f"Tokens: {token_count}")JSON file with a questions array (or direct array at top level):
{
"questions": [
{
"question": "What are the three main stages of a RAG pipeline?",
"relevant": ["indexing", "retrieval", "generation"]
},
{
"question": "What is the main advantage of RAG over pure generative models?",
"relevant": ["grounding", "retrieved documents", "hallucinate"]
}
]
}question: The query text used for chunk retrievalrelevant: List of phrases/terms that should appear in relevant chunks
Recall calculation: For each question, the tool retrieves top-k chunks using lexical similarity and checks how many relevant phrases appear in those chunks. Recall = (found phrases) / (total relevant phrases). Average recall is computed across all questions.
Number of chunks created by the strategy. More chunks = finer granularity but higher indexing cost.
Percentage of relevant phrases successfully retrieved in top-k chunks (0.0 to 1.0). Higher is better.
Interpreting recall:
- > 0.85: Excellent - strategy preserves most relevant information
- 0.70 - 0.85: Good - acceptable for most use cases
- 0.50 - 0.70: Fair - consider adjusting chunk size or strategy
- < 0.50: Poor - important information being lost or fragmented
Directory where chunks are written as individual .txt files for inspection.
| Strategy | Best For | Chunk Size Recommendation |
|---|---|---|
| fixed-size | Uniform processing, consistent latency | 150-250 words |
| sliding-window | Preserving context at boundaries, dense text | 120-200 words, 20-30% overlap |
| paragraph | Well-structured docs with clear sections | N/A (variable) |
General guidelines:
- Start with paragraph for markdown with clear structure
- Use sliding-window if paragraphs are too long (>300 words)
- Use fixed-size as baseline for comparison
- Always test with representative questions from your domain
Add a new chunking strategy:
- Implement a function in
src/chunker.py:
def my_custom_chunks(text: str, chunk_size: int, overlap: int) -> List[Dict]:
chunks = []
# Your logic here
chunks.append({"id": 0, "text": "chunk text"})
return chunks- Register in
STRATEGIES:
STRATEGIES = {
"custom": lambda text, chunk_size=200, overlap=0: my_custom_chunks(text, chunk_size, overlap),
...
}- Use via CLI:
rag-chunk analyze docs/ --strategy custom --chunk-size 180rag-chunk/
├── src/
│ ├── __init__.py
│ ├── parser.py # Markdown parsing and cleaning
│ ├── chunker.py # Chunking strategies
│ ├── scorer.py # Retrieval and recall evaluation
│ └── cli.py # Command-line interface
├── tests/
│ └── test_basic.py # Unit tests
├── examples/
│ ├── rag_introduction.md
│ ├── chunking_strategies.md
│ ├── evaluation_metrics.md
│ └── questions.json
├── .chunks/ # Generated chunks (gitignored)
├── pyproject.toml
├── README.md
└── .gitignore
MIT
By default, --chunk-size and --overlap count words (whitespace-based tokenization). This keeps the tool simple and dependency-free.
For precise token-level chunking that matches LLM token counts (e.g., OpenAI GPT models using subword tokenization), use the --use-tiktoken flag after installing the optional dependency:
pip install rag-chunk[tiktoken]
rag-chunk analyze docs/ --strategy fixed-size --chunk-size 512 --use-tiktokenSee the Using Tiktoken section for more details.
