Description
Smart Chunker
People struggle to get data into Chroma, especially for the first time. There is currently no way to get a text document into Chroma without thinking about chunking separately, and leaving our product surface.
Additionally, it’s very easy to get poor accuracy at the chunking stage, by creating chunks which are too big to fit into the embedding model’s context window; documents get silently truncated, information is lost.
Chroma collections should ship a default chunker that you can just throw text documents at, and it will do the right thing. It’s up to you to get your documents into text, but we will chunk them.
API Design
# getting a chunker
chunker = chromadb.utils.chunkers.Chunker()
# API 1.
collection.add( ... , chunker=chunker) # avoids having to store a chunker on a collection, can re-use your chunker
# API 2.
collection.chunk_add(..., chunker=chunker)
# API 3.
chunker_results = chunker.chunk(ids, docs, token_limit) -> Results
.... # user asigns IDs
collection.add(**chunker_results)
# API 4.
chunker.add(collection, docs)
# Impl 1.
# Chunker is a protocol class with a similar interface to EFs
class Chunker(Protocol):
...
__call__(docs: List[Doc], collection: Collection, ...) -> List[Doc], Optional[Metadatas]
# Find out the token limit on the collection's EF.
# This can be a variable on Collection assigned at creation so that Chunkers don't need to know about EFs
# Process chunks
# Return the list of chunks for embedding
# The chunker might generate some metadata to keep the doc-chunk association etc.
# Impl 2.
# Chunker is something that gets passed to an embedding function, which knows its own token limit.
# When it sees a doc that's too long, it calls the chunker.
# Alternatively it batches all these for one chunking pass.
[Complexity] Subtask
-
[LOW] We can adapt the recursive character splitter from LangChain as a sensible default to start with.
-
[MED] We can make the chunker aware of the collection’s EF’s context window length and set its parameters accordingly. It can get these on invocation of
__call__
so theChunker
itself is completely stateless except for its ‘chunking model’. -
[HIGH] In some use-cases, users want to keep the association between documents and chunks, and retrieve whole documents. We can use metadata fields to do that but there is some complexity in ID assignment, and creating a post-processing step to reconstitute documents if we want maximum ease of use.
Misc
- Chunking doesn’t work / make sense with multi-modal, though we could totally do image patches for ‘chunks’ in future.
- This doesn’t touch the underlying server API at all, and is designed to be completely stateless, i.e. chunkers are just a pipe.