Production-ready document processing pipeline for RAG systems with hybrid dense+sparse embeddings and AI-powered semantic enrichment.
This pipeline transforms raw documents into production-ready vector embeddings for RAG systems. It combines TF-IDF sparse vectors (keyword matching) with dense embeddings (semantic search) for maximum retrieval quality.
The pipeline follows a multi-stage process:
Documents are split into individual sentences using NLP rules.
The C-optimized chunker groups sentences using dot product similarity between embeddings - not just token counts. This creates semantically coherent chunks that represent complete thoughts.
- Fits a TF-IDF vectorizer on the entire corpus
- Extracts top-10 keywords per chunk
- Serializes the vectorizer to
tfidf_vectorizer.joblibfor consistent query processing - Generates sparse vectors (indices + values) for hybrid search
Gemini 2.5 Pro analyzes each chunk to extract:
- Document metadata (type, date, author, scope)
- Semantic keywords beyond TF-IDF (e.g., "cold email" → "lead generation")
- Contribution summary: what unique value this chunk provides
This metadata is prepended to the chunk text for embedding optimization.
Voyage AI 3.5 generates 1024-dimensional vectors from the enriched text.
Hybrid Mode (dense + sparse):
python qdrant_uploader.py collection_name --sparseBest retrieval quality. Requires the TF-IDF vectorizer joblib at query time.
Dense-Only Mode (default):
python qdrant_uploader.py collection_namePerfect for serverless/edge deployments where you can't host the joblib file. Still provides excellent semantic search without keyword boosting.
# Clone repository
git clone https://github.com/yourusername/rag-chunking-pipeline.git
cd rag-chunking-pipeline
# Run initialization script
bash setup.sh
# Configure API keys
cp .env.example .env
# Edit .env with your API keysEdit .env with your API keys:
GEMINI_API_KEY=your_gemini_key
VOYAGE_API_KEY=your_voyage_key
QDRANT_URL=https://your-instance.qdrant.io
QDRANT_API_KEY=your_qdrant_keyPlace documents in benchwork/ directory, then:
# Full pipeline
python src/orquester.py
# Or step-by-step:
python src/sparser.py # 1. Generate sparse vectors
python src/chunk_richer.py # 2. AI enrichment
python src/final_embedder.py # 3. Dense embeddings
python src/qdrant_uploader.py my_collection --sparse # 4. UploadDocuments → Sentences → Semantic Chunks → Sparse Vectors → AI Enrichment → Dense Embeddings → Qdrant
(split) (dot product) (TF-IDF) (Gemini) (Voyage AI) (hybrid)
- ✅ Semantic chunking via dot product similarity (not just token count)
- ✅ Hybrid vectors: TF-IDF sparse + Voyage AI dense (1024-dim)
- ✅ Flexible deployment: Hybrid mode or dense-only (no joblib needed)
- ✅ AI enrichment: Gemini extracts metadata and semantic keywords
- ✅ Adaptive batching: Automatic retry with smaller batches on failures
- ✅ Production-ready: Encoding detection, fallbacks, error handling
Each chunk contains:
{
"texto": "KEYWORDS:lead generation cold email CONTEXT:Explains outreach strategy... [original text]",
"dense_vector": [0.023, -0.041, ...], // 1024 dims
"sparse_vector": {
"indices": [45, 89, 123],
"values": [0.82, 0.71, 0.65]
},
"id": "unique_chunk_id",
"fitxer_origen": "source_document.md"
}Tested on 200+ YouTube transcripts:
- Processing: ~50-100 chunks/minute
- Gemini batching: 100 chunks/call
- Embedding: ~900 chunks/API call
- Upload: ~200 chunks/second
rag-chunking-pipeline/
├── src/
│ ├── chunk_builder.c # Semantic chunker (C)
│ ├── sparser.py # TF-IDF sparse vectors
│ ├── chunk_richer.py # Gemini enrichment
│ ├── final_embedder.py # Dense embeddings
│ ├── qdrant_uploader.py # Upload to Qdrant
│ └── orquester.py # Pipeline orchestrator
├── setup.sh # Initialization script
├── requirements.txt
└── README.md
- 📚 Documentation search (technical docs, wikis)
- ⚖️ Legal/compliance (regulations, contracts)
- 🔬 Research papers (academic literature)
- 💬 Customer support (knowledge bases)
- 🎓 Educational content (courses, tutorials)
# sparser.py
custom_stopwords = set([...]) # Customize stopwords
max_df=0.8 # Max document frequency
min_df=2 # Min document frequency# chunk_richer.py
BATCH_SIZES = [100, 80, 60, 40, 30, 20, 10] # Adaptive batching
model = "gemini-2.5-pro"# final_embedder.py
BATCH_SIZE = 900
model = "voyage-3.5"
output_dimension = 1024| Feature | Hybrid Mode | Dense-Only Mode |
|---|---|---|
| Retrieval Quality | ⭐⭐⭐⭐⭐ Best | ⭐⭐⭐⭐ Excellent |
| Keyword Matching | ✅ TF-IDF boosting | ❌ No exact matching |
| Query Processing | Requires joblib | Zero preprocessing |
| Deployment | Server/VM | Serverless/Edge |
| Use Case | Maximum precision | Simplified deployment |
MIT License - see LICENSE for details.
Built for processing AI agency content. Powered by Gemini 2.5 Pro, Voyage AI 3.5, and Qdrant.
Built with ❤️ for RAG enthusiasts