Is this related to a problem?
There's no easy way to query the knowledge base for statistics like document count by source, SDK version coverage, or chunk distribution. This makes it hard to monitor RAG quality or debug retrieval issues.
Describe the feature
Add diagnostic methods to the VectorDB class in src/embeddings/vectordb.py:
- get_stats_by_source() — document and chunk count per source
- get_sdk_version_coverage() — how many chunks exist per SDK version
- get_chunk_size_distribution() — min/max/avg/percentile of chunk token counts
Which module does this relate to?
Example usage
db = VectorDB()
stats = db.get_stats_by_source()
# {'stylus-sdk': {'docs': 45, 'chunks': 312}, 'arbitrum-docs': {'docs': 23, 'chunks': 156}, ...}
coverage = db.get_sdk_version_coverage()
# {'0.10.0': 245, '0.9.2': 89, '0.6.1': 12}
Additional context
- Files: src/embeddings/vectordb.py
- ChromaDB collections support metadata queries — these methods would wrap collection queries
- Useful for the M5 final report metrics and ongoing RAG quality monitoring
Is this related to a problem?
There's no easy way to query the knowledge base for statistics like document count by source, SDK version coverage, or chunk distribution. This makes it hard to monitor RAG quality or debug retrieval issues.
Describe the feature
Add diagnostic methods to the VectorDB class in src/embeddings/vectordb.py:
Which module does this relate to?
Example usage
Additional context