Skip to content

Colab notebook error - Collection [auto_chunk] does not exists #13

@silverfoxf7

Description

@silverfoxf7

To reproduce, I ran your Google Colab notebook, and got stuck on this step:

# Initialize evaluation
evaluation = GeneralEvaluation()

results = []

# Initialize an empty DataFrame
df = pd.DataFrame()

# Display the DataFrame
display_handle = display(df, display_id=True)

for chunker in chunkers:
    result = evaluation.run(chunker, ef, retrieve=5)
    del result['corpora_scores']  # Remove detailed scores for brevity
    chunk_size = chunker._chunk_size if hasattr(chunker, '_chunk_size') else 0
    chunk_overlap = chunker._chunk_overlap if hasattr(chunker, '_chunk_overlap') else 0
    result['chunker'] = chunker.__class__.__name__ + f"_{chunk_size}_{chunk_overlap}"
    results.append(result)

    # Update the DataFrame
    df = pd.DataFrame(results)
    clear_output(wait=True)
    display_handle.update(df)

output:

---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
<ipython-input-12-60d3982b15a8> in <cell line: 0>()
     11 
     12 for chunker in chunkers:
---> 13     result = evaluation.run(chunker, ef, retrieve=5)
     14     del result['corpora_scores']  # Remove detailed scores for brevity
     15     chunk_size = chunker._chunk_size if hasattr(chunker, '_chunk_size') else 0

3 frames
/usr/local/lib/python3.11/dist-packages/chromadb/api/rust.py in delete_collection(self, name, tenant, database)
    284         database: str = DEFAULT_DATABASE,
    285     ) -> None:
--> 286         self.bindings.delete_collection(name, tenant, database)
    287 
    288     @override

NotFoundError: Collection [auto_chunk] does not exists

Summary of Troubleshooting for "Collection [auto_chunk] does not exist" with chunking_evaluation

The core problem revolved around the GeneralEvaluation class from the chunking_evaluation library failing to find or create a ChromaDB collection named auto_chunk, which it internally uses during its run method.

Attempts and Outcomes:

  1. Using a Persistent ChromaDB Path:

    • Attempt: Modified scripts/evaluate_chunkers.py to instantiate GeneralEvaluation with a specific file path for chroma_db_path (e.g., PROJECT_ROOT / "eval_chroma_db") instead of letting it default to an in-memory ChromaDB.
    • Hypothesis: An in-memory client might have issues with collection persistence or lifecycle across different calls within the library. A persistent path might provide clearer errors or resolve transient issues.
    • Outcome: Did not work. The "Collection [auto_chunk] does not exist" error persisted even with a file-based ChromaDB.
  2. Explicit Collection Pre-creation:

    • Attempt: Added code to scripts/evaluate_chunkers.py to manually get the ChromaDB client from the GeneralEvaluation instance and explicitly try to delete (if exists) and then create the auto_chunk collection before calling the evaluation.run() method.
    • Hypothesis: The library's internal create_collection call might be failing silently or not executing as expected. Pre-creating it would isolate this.
    • Outcome: Led to a new error: 'GeneralEvaluation' object has no attribute 'client'. This revealed a deeper issue with the GeneralEvaluation object's initialization or the attribute name itself.
  3. Debugging client vs. chroma_client Attribute:

    • Investigation: Realized that BaseEvaluation (parent of GeneralEvaluation) initializes self.chroma_client, not self.client. The debug code for pre-creation was looking for the wrong attribute.
    • Correction: Adjusted debug prints to check for self.chroma_client.
    • Hypothesis (for original error): The run method internally might be incorrectly trying to use self.client (which wouldn't exist) instead of self.chroma_client for collection operations.
    • Outcome (for attribute check): Confirmed evaluation_instance.chroma_client existed.
    • Outcome (for original error): The "Collection [auto_chunk] does not exist" error remained. The library's run method indeed seemed to use self.client in some places and self.chroma_client in others, or the collection creation was still failing for other reasons.
  4. Allowing evaluation.run() to use its Default Embedding Function:

    • Attempt: Modified scripts/evaluate_chunkers.py to call evaluation.run() without passing our custom default_ef (OpenAI embedding function), letting the library use its internal default.
    • Hypothesis: Our custom embedding function might be incompatible or causing issues with the collection creation or interaction within the library.
    • Outcome: Did not work. The "Collection [auto_chunk] does not exist" error persisted.
  5. Checking and Downgrading chromadb Version:

    • Investigation:
      • Checked chunking_evaluation's setup.py: Found it listed chromadb as a dependency but without a specific version, suggesting potential for version incompatibility.
      • Checked installed chromadb version: Found 1.0.8.
      • Compared chromadb release history with chunking_evaluation commit history: chunking_evaluation was likely developed against an older chromadb version (pre-0.5.x or around 0.4.x based on timeline, while 1.0.8 was much newer).
    • Attempt: Downgraded chromadb from 1.0.8 to 0.4.24 (a version likely compatible with when chunking_evaluation was active). We also tried 0.6.3 earlier, which also showed issues but different ones. The final attempt related to auto_chunk was with a significantly older version.
    • Hypothesis: A chromadb API incompatibility between the version chunking_evaluation expected and the version we were using (1.0.8) was causing create_collection or subsequent operations to fail in a way that resulted in the "collection does not exist" error.
    • Outcome (with chromadb 0.4.24): Partially changed the behavior but didn't fully resolve. The "Collection [auto_chunk] does not exist" error seemed to be bypassed or changed. However, new errors emerged, such as APIStatusError.__init__() missing 2 required keyword-only arguments: 'response' and 'body' for MyChunkerWrapper's evaluation, and other warnings for ClusterSemanticChunker. This indicated that while the collection issue might have been sidestepped, the downgrade introduced other deep incompatibilities with the chromadb client API that chunking_evaluation was not equipped to handle with our MyChunkerWrapper or that the library itself had other issues with that old version.

Conclusion of aformentioned troubleshooting:
The "Collection [auto_chunk] does not exist" error within the chunking_evaluation library proved difficult to resolve directly. While chromadb version incompatibility was a strong suspect and downgrading did change the error landscape, it didn't lead to a clean, working evaluation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions