Skip to content

[New Feature][Ease of Use] Smart Chunker #2281

Open
@atroyn

Description

@atroyn

Smart Chunker

People struggle to get data into Chroma, especially for the first time. There is currently no way to get a text document into Chroma without thinking about chunking separately, and leaving our product surface.

Additionally, it’s very easy to get poor accuracy at the chunking stage, by creating chunks which are too big to fit into the embedding model’s context window; documents get silently truncated, information is lost.

Chroma collections should ship a default chunker that you can just throw text documents at, and it will do the right thing. It’s up to you to get your documents into text, but we will chunk them.

API Design

# getting a chunker 
chunker = chromadb.utils.chunkers.Chunker()

# API 1.
collection.add( ... , chunker=chunker) # avoids having to store a chunker on a collection, can re-use your chunker

# API 2. 
collection.chunk_add(..., chunker=chunker)

# API 3. 
chunker_results = chunker.chunk(ids, docs, token_limit) -> Results
.... # user asigns IDs
collection.add(**chunker_results)

# API 4.
chunker.add(collection, docs) 

# Impl 1.
# Chunker is a protocol class with a similar interface to EFs
class Chunker(Protocol):
    ... 
    
    __call__(docs: List[Doc], collection: Collection, ...) -> List[Doc], Optional[Metadatas]
			# Find out the token limit on the collection's EF. 
			# This can be a variable on Collection assigned at creation so that Chunkers don't need to know about EFs
			# Process chunks
			# Return the list of chunks for embedding 
			# The chunker might generate some metadata to keep the doc-chunk association etc.
			
# Impl 2.
# Chunker is something that gets passed to an embedding function, which knows its own token limit. 
# When it sees a doc that's too long, it calls the chunker. 
# Alternatively it batches all these for one chunking pass. 

[Complexity] Subtask

  • [LOW] We can adapt the recursive character splitter from LangChain as a sensible default to start with.

  • [MED] We can make the chunker aware of the collection’s EF’s context window length and set its parameters accordingly. It can get these on invocation of __call__ so the Chunker itself is completely stateless except for its ‘chunking model’.

  • [HIGH] In some use-cases, users want to keep the association between documents and chunks, and retrieve whole documents. We can use metadata fields to do that but there is some complexity in ID assignment, and creating a post-processing step to reconstitute documents if we want maximum ease of use.

Misc

  • Chunking doesn’t work / make sense with multi-modal, though we could totally do image patches for ‘chunks’ in future.
  • This doesn’t touch the underlying server API at all, and is designed to be completely stateless, i.e. chunkers are just a pipe.

Metadata

Metadata

Assignees

Labels

Local ChromaAn improvement to Local (single node) Chromaby-chromain-progressCurrently working on this

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions