semantic-chunker

A strongly-typed semantic text chunking library that intelligently splits content while preserving structure and meaning. Built on top of semantic-text-splitter with an enhanced type-safe API.

Features

🎯 Multiple tokenization strategies:
- OpenAI's tiktoken models (e.g., "gpt-3.5-turbo")
- Hugging Face tokenizers (from objects, JSON strings, or files)
- Custom tokenization callbacks
📝 Three specialized chunking modes:
- Plain text
- Markdown (preserves structure)
- Code (preserves syntax via tree-sitter)
🔄 Configurable chunk overlapping
✂️ Optional whitespace trimming
💪 Full type safety with Protocol types

Installation

Basic installation (text and markdown support):

pip install semantic-chunker

With code chunking support:

pip install semantic-chunker[code]

With Hugging Face tokenizers support:

pip install semantic-chunker[tokenizers]

With all features:

pip install semantic-chunker[all]

Usage

Text Chunking

from semantic_chunker import get_chunker

plain_text = """Contrary to popular belief, Lorem Ipsum is not simply random text. ..."""

chunker = get_chunker(
    "gpt-3.5-turbo",
    chunking_type="text",  # required
    max_tokens=10,  # required
    trim=False,  # default True
    overlap=5,  # default 0
)

chunks = chunker.chunks(plain_text)  # list[str]
chunk_with_indices = chunker.chunk_with_indices(plain_text)  # list[tuple[str, int]]

Markdown Chunking

from semantic_chunker import get_chunker

markdown_text = """# Lorem Ipsum Intro ..."""

chunker = get_chunker(
    "gpt-3.5-turbo",
    chunking_type="markdown",
    max_tokens=10,
    trim=False,
    overlap=5,
)

chunks = chunker.chunks(markdown_text)  # list[str]
chunk_with_indices = chunker.chunk_with_indices(markdown_text)  # list[tuple[str, int]]

Code Chunking

from semantic_chunker import get_chunker

kotlin_snippet = """import kotlin.random.Random ..."""

chunker = get_chunker(
   "gpt-3.5-turbo",
   chunking_type="code",
   max_tokens=10,
   tree_sitter_language="kotlin",  # required for code chunking
   trim=False,
   overlap=5,
)

chunks = chunker.chunks(kotlin_snippet)  # list[str]
chunk_with_indices = chunker.chunk_with_indices(kotlin_snippet)  # list[tuple[str, int]]

Error Handling

# Missing language for code chunking
try:
    chunker = get_chunker("gpt-4", chunking_type="code", max_tokens=10)
except ValueError as e:
    print(e)  # "Language must be provided for code chunking."

# Missing required package for code chunking
try:
    chunker = get_chunker("gpt-4", chunking_type="code", tree_sitter_language="python", max_tokens=10)
except ModuleNotFoundError as e:
    print(e)  # "tree-sitter-language-pack is required for 'code' style chunking..."

Chunking Type and Tokenization Options

get_chunker requires the first argument to be one of:
1. A tiktoken model name string (e.g., gpt-4o)
2. A function that takes a string and returns a token count (integer)
3. A tokenizers.Tokenizer instance
4. A string path to a tokenizers tokenizer JSON file
Required kwargs:
- chunking_type: Either text, markdown, or code.
- max_tokens: Maximum tokens per chunk. Accepts an integer or a tuple (min, max).
- If chunking_type is code, tree_sitter_language is required.

Contribution

This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before submitting PRs to avoid disappointment.

Local Development

Clone the repo
Install the system dependencies
Install the full dependencies with uv sync

Install the pre-commit hooks with:

pre-commit install && pre-commit install --hook-type commit-msg

Make your changes and submit a PR

License

This library uses the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
scripts		scripts
semantic_chunker		semantic_chunker
tests		tests
.commitlintrc		.commitlintrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

semantic-chunker

Features

Installation

Usage

Text Chunking

Markdown Chunking

Code Chunking

Error Handling

Chunking Type and Tokenization Options

Contribution

Local Development

License

About

Releases 2

Packages

Languages

License

Goldziher/semantic-chunker

Folders and files

Latest commit

History

Repository files navigation

semantic-chunker

Features

Installation

Usage

Text Chunking

Markdown Chunking

Code Chunking

Error Handling

Chunking Type and Tokenization Options

Contribution

Local Development

License

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages