-
Notifications
You must be signed in to change notification settings - Fork 7.5k
Description
Self Checks
- I have searched for existing issues search for existing issues, including closed ones.
- I confirm that I am using English to submit this report (Language Policy).
- Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- Please do not modify this template :) and fill in all the required fields.
Is your feature request related to a problem?
Describe the feature you'd like
Problem Statement: Bridging the "Demo-to-Production" Gap
RAGflow currently demonstrates strong performance in proof-of-concept (PoC) scenarios. However, when deployed in production environments with diverse knowledge bases and large-scale document collections (tens of thousands of documents), the existing "single-layer retrieval" architecture—which flattens all document chunks into a single vector search space—reveals significant limitations in both accuracy and efficiency.
Key Challenges:
-
Chunk Fragmentation Issues
- Context Fragmentation: Improper segmentation disrupts natural semantic units, resulting in incomplete information within individual chunks and degraded semantic representation.
- Information Dilution: Critical information ("gold nuggets") is often split across multiple chunks, making comprehensive retrieval challenging and reducing answer quality.
-
Embedding Model Limitations
- Theoretical Constraints: As established in research papers like "On the Theoretical Limitations of Embedding-Based Retrieval" the dimensionality of embedding vectors fundamentally limits the number of "document-query" relevance relationships that can be perfectly represented.
- Practical Bottlenecks: Commonly deployed private embedding models (e.g., qwen3-embedding-0.6B, jina-embeddings-v3 with 1024 dimensions) may lack sufficient capacity to encode complex semantic relationships at scale. While higher-dimensional models (4096/8196 dim) exist, they impose prohibitive hardware requirements and computational costs for private deployments.
- Retrieval Precision Degradation: Direct vector search across millions of chunks becomes computationally expensive and prone to vector space "crowding" and "confusion," causing relevant chunks to rank lower.
-
Underutilized Metadata
- Valuable document metadata (department, author, date, document type, etc.) remains largely untapped as systematic pre-retrieval filters, wasting crucial structured information.
Proposed Solution: Three-Tier Retrieval Architecture
Inspired by search engine hierarchical principles, we propose a Knowledge Base → Document → Chunk three-tier retrieval architecture to progressively narrow the search scope and enhance both precision and efficiency.
Tier 1: Knowledge Base Routing
- Function: Automatically routes user queries to the most relevant knowledge base based on intent.
- Implementation:
- Support independent retrieval parameters per knowledge base (vector/keyword weights, recall thresholds).
- Enable dynamic routing via rule-based or LLM-based approaches to ensure domain-specific processing.
Tier 2: Document Filtering
- Function: Applies document-level metadata filtering within selected knowledge bases to identify relevant document subsets.
- Enhancements:
- Intelligent Metadata Filtering: In Auto mode, allow users to specify key metadata fields (e.g., document type, department) with LLM-generated filter conditions to avoid high-cardinality metadata interference.
- Metadata Similarity Matching: Introduce similarity operators for text-based metadata (document names, summaries) to support fuzzy matching.
- Enhanced Metadata Generation: Strengthen Data Pipeline capabilities for full-text metadata and summary generation to enrich document filtering context.
- Efficient metadata management function: batch CRUD of metadata;metadata management UI.
Tier 3: Chunk Refinement
- Function: Performs precise vector retrieval at the chunk level within the filtered document set.
- Enhancements:
- Parent-Child Chunking with Summary Mapping: Enable creation of parent-level summaries for contextually related chunks. Retrieval first matches macro-themes via summary vectors, then maps to original chunks for details—combining semantic robustness with granular information access.
- Customizable Prompts: Allow users to configure custom prompts for chunk keyword extraction and question generation tasks to better align with domain-specific semantics.
Complementary Data Pipeline Enhancements
- Data Pipeline can work as a complementary enhancement to Build-in Methods, not only a replacement.
- Focus on strengthening full-text metadata generation and document-level summarization capabilities to provide robust data foundation for hierarchical retrieval.
Expected Benefits
Implementing this hierarchical retrieval architecture will enable RAGflow's critical transition from "feasible" to "production-ready":
- Improved Recall Precision: Layered filtering effectively focuses on relevant regions, reducing interference from irrelevant chunks and fundamentally addressing embedding model limitations.
- Optimized System Performance: Significantly reduces vector search candidate sets, lowering computational overhead and improving response latency.
- Enhanced System Intelligence & Flexibility: Knowledge base routing and intelligent metadata filtering enable better understanding of user intent and adaptation to complex production environments.
- Reduced Operational Costs: Template-based, batch-enabled metadata management tools minimize maintenance overhead.
Implementation Priority
High - This architecture addresses fundamental scalability and precision limitations critical for production deployments.
Describe implementation you've considered
No response
Documentation, adoption, use case
Additional information
No response