Colab notebook error - Collection [auto_chunk] does not exists

To reproduce, I ran your Google Colab notebook, and got stuck on this step: 

```
# Initialize evaluation
evaluation = GeneralEvaluation()

results = []

# Initialize an empty DataFrame
df = pd.DataFrame()

# Display the DataFrame
display_handle = display(df, display_id=True)

for chunker in chunkers:
    result = evaluation.run(chunker, ef, retrieve=5)
    del result['corpora_scores']  # Remove detailed scores for brevity
    chunk_size = chunker._chunk_size if hasattr(chunker, '_chunk_size') else 0
    chunk_overlap = chunker._chunk_overlap if hasattr(chunker, '_chunk_overlap') else 0
    result['chunker'] = chunker.__class__.__name__ + f"_{chunk_size}_{chunk_overlap}"
    results.append(result)

    # Update the DataFrame
    df = pd.DataFrame(results)
    clear_output(wait=True)
    display_handle.update(df)
```

output: 

```
---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
<ipython-input-12-60d3982b15a8> in <cell line: 0>()
     11 
     12 for chunker in chunkers:
---> 13     result = evaluation.run(chunker, ef, retrieve=5)
     14     del result['corpora_scores']  # Remove detailed scores for brevity
     15     chunk_size = chunker._chunk_size if hasattr(chunker, '_chunk_size') else 0

3 frames
/usr/local/lib/python3.11/dist-packages/chromadb/api/rust.py in delete_collection(self, name, tenant, database)
    284         database: str = DEFAULT_DATABASE,
    285     ) -> None:
--> 286         self.bindings.delete_collection(name, tenant, database)
    287 
    288     @override

NotFoundError: Collection [auto_chunk] does not exists
```

**Summary of Troubleshooting for "Collection `[auto_chunk]` does not exist" with `chunking_evaluation`**

The core problem revolved around the `GeneralEvaluation` class from the `chunking_evaluation` library failing to find or create a ChromaDB collection named `auto_chunk`, which it internally uses during its `run` method.

**Attempts and Outcomes:**

1.  **Using a Persistent ChromaDB Path:**
    *   **Attempt:** Modified `scripts/evaluate_chunkers.py` to instantiate `GeneralEvaluation` with a specific file path for `chroma_db_path` (e.g., `PROJECT_ROOT / "eval_chroma_db"`) instead of letting it default to an in-memory ChromaDB.
    *   **Hypothesis:** An in-memory client might have issues with collection persistence or lifecycle across different calls within the library. A persistent path might provide clearer errors or resolve transient issues.
    *   **Outcome:** **Did not work.** The "Collection `[auto_chunk]` does not exist" error persisted even with a file-based ChromaDB.

2.  **Explicit Collection Pre-creation:**
    *   **Attempt:** Added code to `scripts/evaluate_chunkers.py` to manually get the ChromaDB client from the `GeneralEvaluation` instance and explicitly try to delete (if exists) and then create the `auto_chunk` collection *before* calling the `evaluation.run()` method.
    *   **Hypothesis:** The library's internal `create_collection` call might be failing silently or not executing as expected. Pre-creating it would isolate this.
    *   **Outcome:** **Led to a new error:** `'GeneralEvaluation' object has no attribute 'client'`. This revealed a deeper issue with the `GeneralEvaluation` object's initialization or the attribute name itself.

3.  **Debugging `client` vs. `chroma_client` Attribute:**
    *   **Investigation:** Realized that `BaseEvaluation` (parent of `GeneralEvaluation`) initializes `self.chroma_client`, not `self.client`. The debug code for pre-creation was looking for the wrong attribute.
    *   **Correction:** Adjusted debug prints to check for `self.chroma_client`.
    *   **Hypothesis (for original error):** The `run` method internally might be incorrectly trying to use `self.client` (which wouldn't exist) instead of `self.chroma_client` for collection operations.
    *   **Outcome (for attribute check):** Confirmed `evaluation_instance.chroma_client` existed.
    *   **Outcome (for original error):** The "Collection `[auto_chunk]` does not exist" error remained. The library's `run` method indeed seemed to use `self.client` in some places and `self.chroma_client` in others, or the collection creation was still failing for other reasons.

4.  **Allowing `evaluation.run()` to use its Default Embedding Function:**
    *   **Attempt:** Modified `scripts/evaluate_chunkers.py` to call `evaluation.run()` *without* passing our custom `default_ef` (OpenAI embedding function), letting the library use its internal default.
    *   **Hypothesis:** Our custom embedding function might be incompatible or causing issues with the collection creation or interaction within the library.
    *   **Outcome:** **Did not work.** The "Collection `[auto_chunk]` does not exist" error persisted.

5.  **Checking and Downgrading `chromadb` Version:**
    *   **Investigation:**
        *   Checked `chunking_evaluation`'s `setup.py`: Found it listed `chromadb` as a dependency but without a specific version, suggesting potential for version incompatibility.
        *   Checked installed `chromadb` version: Found `1.0.8`.
        *   Compared `chromadb` release history with `chunking_evaluation` commit history: `chunking_evaluation` was likely developed against an older `chromadb` version (pre-`0.5.x` or around `0.4.x` based on timeline, while `1.0.8` was much newer).
    *   **Attempt:** Downgraded `chromadb` from `1.0.8` to `0.4.24` (a version likely compatible with when `chunking_evaluation` was active). We also tried `0.6.3` earlier, which also showed issues but different ones. The final attempt related to `auto_chunk` was with a significantly older version.
    *   **Hypothesis:** A `chromadb` API incompatibility between the version `chunking_evaluation` expected and the version we were using (`1.0.8`) was causing `create_collection` or subsequent operations to fail in a way that resulted in the "collection does not exist" error.
    *   **Outcome (with `chromadb 0.4.24`):** **Partially changed the behavior but didn't fully resolve.** The "Collection `[auto_chunk]` does not exist" error *seemed* to be bypassed or changed. However, new errors emerged, such as `APIStatusError.__init__() missing 2 required keyword-only arguments: 'response' and 'body'` for `MyChunkerWrapper`'s evaluation, and other warnings for `ClusterSemanticChunker`. This indicated that while the collection issue might have been sidestepped, the downgrade introduced other deep incompatibilities with the `chromadb` client API that `chunking_evaluation` was not equipped to handle with our `MyChunkerWrapper` or that the library itself had other issues with that old version.

**Conclusion of aformentioned troubleshooting:**
The "Collection `[auto_chunk]` does not exist" error within the `chunking_evaluation` library proved difficult to resolve directly. While `chromadb` version incompatibility was a strong suspect and downgrading did change the error landscape, it didn't lead to a clean, working evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colab notebook error - Collection [auto_chunk] does not exists #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Colab notebook error - Collection [auto_chunk] does not exists #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions