Skip to content

aiLinkCodes/semantic-rag-chunker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Chunking Pipeline

Production-ready document processing pipeline for RAG systems with hybrid dense+sparse embeddings and AI-powered semantic enrichment.

Python Qdrant Gemini Voyage AI

Overview

This pipeline transforms raw documents into production-ready vector embeddings for RAG systems. It combines TF-IDF sparse vectors (keyword matching) with dense embeddings (semantic search) for maximum retrieval quality.

How It Works

The pipeline follows a multi-stage process:

1. Sentence Splitting

Documents are split into individual sentences using NLP rules.

2. Semantic Chunking (Dot Product Similarity)

The C-optimized chunker groups sentences using dot product similarity between embeddings - not just token counts. This creates semantically coherent chunks that represent complete thoughts.

3. TF-IDF Sparse Vectors + Joblib

  • Fits a TF-IDF vectorizer on the entire corpus
  • Extracts top-10 keywords per chunk
  • Serializes the vectorizer to tfidf_vectorizer.joblib for consistent query processing
  • Generates sparse vectors (indices + values) for hybrid search

4. AI Enrichment with Gemini

Gemini 2.5 Pro analyzes each chunk to extract:

  • Document metadata (type, date, author, scope)
  • Semantic keywords beyond TF-IDF (e.g., "cold email" → "lead generation")
  • Contribution summary: what unique value this chunk provides

This metadata is prepended to the chunk text for embedding optimization.

5. Dense Embeddings

Voyage AI 3.5 generates 1024-dimensional vectors from the enriched text.

6. Qdrant Upload (Flexible Modes)

Hybrid Mode (dense + sparse):

python qdrant_uploader.py collection_name --sparse

Best retrieval quality. Requires the TF-IDF vectorizer joblib at query time.

Dense-Only Mode (default):

python qdrant_uploader.py collection_name

Perfect for serverless/edge deployments where you can't host the joblib file. Still provides excellent semantic search without keyword boosting.

Quick Start

1. Installation

# Clone repository
git clone https://github.com/yourusername/rag-chunking-pipeline.git
cd rag-chunking-pipeline

# Run initialization script
bash setup.sh

# Configure API keys
cp .env.example .env
# Edit .env with your API keys

2. Configure Environment

Edit .env with your API keys:

GEMINI_API_KEY=your_gemini_key
VOYAGE_API_KEY=your_voyage_key
QDRANT_URL=https://your-instance.qdrant.io
QDRANT_API_KEY=your_qdrant_key

3. Run the Pipeline

Place documents in benchwork/ directory, then:

# Full pipeline
python src/orquester.py

# Or step-by-step:
python src/sparser.py           # 1. Generate sparse vectors
python src/chunk_richer.py      # 2. AI enrichment
python src/final_embedder.py    # 3. Dense embeddings
python src/qdrant_uploader.py my_collection --sparse  # 4. Upload

Pipeline Architecture

Documents → Sentences → Semantic Chunks → Sparse Vectors → AI Enrichment → Dense Embeddings → Qdrant
            (split)      (dot product)     (TF-IDF)         (Gemini)        (Voyage AI)      (hybrid)

Key Features

  • Semantic chunking via dot product similarity (not just token count)
  • Hybrid vectors: TF-IDF sparse + Voyage AI dense (1024-dim)
  • Flexible deployment: Hybrid mode or dense-only (no joblib needed)
  • AI enrichment: Gemini extracts metadata and semantic keywords
  • Adaptive batching: Automatic retry with smaller batches on failures
  • Production-ready: Encoding detection, fallbacks, error handling

Output Structure

Each chunk contains:

{
  "texto": "KEYWORDS:lead generation cold email CONTEXT:Explains outreach strategy... [original text]",
  "dense_vector": [0.023, -0.041, ...],  // 1024 dims
  "sparse_vector": {
    "indices": [45, 89, 123],
    "values": [0.82, 0.71, 0.65]
  },
  "id": "unique_chunk_id",
  "fitxer_origen": "source_document.md"
}

Performance

Tested on 200+ YouTube transcripts:

  • Processing: ~50-100 chunks/minute
  • Gemini batching: 100 chunks/call
  • Embedding: ~900 chunks/API call
  • Upload: ~200 chunks/second

Project Structure

rag-chunking-pipeline/
├── src/
│   ├── chunk_builder.c         # Semantic chunker (C)
│   ├── sparser.py              # TF-IDF sparse vectors
│   ├── chunk_richer.py         # Gemini enrichment
│   ├── final_embedder.py       # Dense embeddings
│   ├── qdrant_uploader.py      # Upload to Qdrant
│   └── orquester.py            # Pipeline orchestrator
├── setup.sh                    # Initialization script
├── requirements.txt
└── README.md

Use Cases

  • 📚 Documentation search (technical docs, wikis)
  • ⚖️ Legal/compliance (regulations, contracts)
  • 🔬 Research papers (academic literature)
  • 💬 Customer support (knowledge bases)
  • 🎓 Educational content (courses, tutorials)

Advanced Configuration

Sparse Vector Generation

# sparser.py
custom_stopwords = set([...])  # Customize stopwords
max_df=0.8  # Max document frequency
min_df=2    # Min document frequency

AI Enrichment

# chunk_richer.py
BATCH_SIZES = [100, 80, 60, 40, 30, 20, 10]  # Adaptive batching
model = "gemini-2.5-pro"

Dense Embeddings

# final_embedder.py
BATCH_SIZE = 900
model = "voyage-3.5"
output_dimension = 1024

Deployment Modes Comparison

Feature Hybrid Mode Dense-Only Mode
Retrieval Quality ⭐⭐⭐⭐⭐ Best ⭐⭐⭐⭐ Excellent
Keyword Matching ✅ TF-IDF boosting ❌ No exact matching
Query Processing Requires joblib Zero preprocessing
Deployment Server/VM Serverless/Edge
Use Case Maximum precision Simplified deployment

License

MIT License - see LICENSE for details.

Acknowledgments

Built for processing AI agency content. Powered by Gemini 2.5 Pro, Voyage AI 3.5, and Qdrant.


Built with ❤️ for RAG enthusiasts

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors