Skip to content

[Feature] Add knowledge base metrics methods to VectorDB #23

@Ervinoreo

Description

@Ervinoreo

Is this related to a problem?
There's no easy way to query the knowledge base for statistics like document count by source, SDK version coverage, or chunk distribution. This makes it hard to monitor RAG quality or debug retrieval issues.

Describe the feature
Add diagnostic methods to the VectorDB class in src/embeddings/vectordb.py:

  • get_stats_by_source() — document and chunk count per source
  • get_sdk_version_coverage() — how many chunks exist per SDK version
  • get_chunk_size_distribution() — min/max/avg/percentile of chunk token counts

Which module does this relate to?

  • RAG / Knowledge Base

Example usage

db = VectorDB()
stats = db.get_stats_by_source()
# {'stylus-sdk': {'docs': 45, 'chunks': 312}, 'arbitrum-docs': {'docs': 23, 'chunks': 156}, ...}

coverage = db.get_sdk_version_coverage()
# {'0.10.0': 245, '0.9.2': 89, '0.6.1': 12}

Additional context

  • Files: src/embeddings/vectordb.py
  • ChromaDB collections support metadata queries — these methods would wrap collection queries
  • Useful for the M5 final report metrics and ongoing RAG quality monitoring

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions