RAG Chunking Pipeline

Production-ready document processing pipeline for RAG systems with hybrid dense+sparse embeddings and AI-powered semantic enrichment.

Overview

This pipeline transforms raw documents into production-ready vector embeddings for RAG systems. It combines TF-IDF sparse vectors (keyword matching) with dense embeddings (semantic search) for maximum retrieval quality.

How It Works

The pipeline follows a multi-stage process:

1. Sentence Splitting

Documents are split into individual sentences using NLP rules.

2. Semantic Chunking (Dot Product Similarity)

The C-optimized chunker groups sentences using dot product similarity between embeddings - not just token counts. This creates semantically coherent chunks that represent complete thoughts.

3. TF-IDF Sparse Vectors + Joblib

Fits a TF-IDF vectorizer on the entire corpus
Extracts top-10 keywords per chunk
Serializes the vectorizer to tfidf_vectorizer.joblib for consistent query processing
Generates sparse vectors (indices + values) for hybrid search

4. AI Enrichment with Gemini

Gemini 2.5 Pro analyzes each chunk to extract:

Document metadata (type, date, author, scope)
Semantic keywords beyond TF-IDF (e.g., "cold email" → "lead generation")
Contribution summary: what unique value this chunk provides

This metadata is prepended to the chunk text for embedding optimization.

5. Dense Embeddings

Voyage AI 3.5 generates 1024-dimensional vectors from the enriched text.

6. Qdrant Upload (Flexible Modes)

Hybrid Mode (dense + sparse):

python qdrant_uploader.py collection_name --sparse

Best retrieval quality. Requires the TF-IDF vectorizer joblib at query time.

Dense-Only Mode (default):

python qdrant_uploader.py collection_name

Perfect for serverless/edge deployments where you can't host the joblib file. Still provides excellent semantic search without keyword boosting.

Quick Start

1. Installation

# Clone repository
git clone https://github.com/yourusername/rag-chunking-pipeline.git
cd rag-chunking-pipeline

# Run initialization script
bash setup.sh

# Configure API keys
cp .env.example .env
# Edit .env with your API keys

2. Configure Environment

Edit .env with your API keys:

GEMINI_API_KEY=your_gemini_key
VOYAGE_API_KEY=your_voyage_key
QDRANT_URL=https://your-instance.qdrant.io
QDRANT_API_KEY=your_qdrant_key

3. Run the Pipeline

Place documents in benchwork/ directory, then:

# Full pipeline
python src/orquester.py

# Or step-by-step:
python src/sparser.py           # 1. Generate sparse vectors
python src/chunk_richer.py      # 2. AI enrichment
python src/final_embedder.py    # 3. Dense embeddings
python src/qdrant_uploader.py my_collection --sparse  # 4. Upload

Pipeline Architecture

Documents → Sentences → Semantic Chunks → Sparse Vectors → AI Enrichment → Dense Embeddings → Qdrant
            (split)      (dot product)     (TF-IDF)         (Gemini)        (Voyage AI)      (hybrid)

Key Features

✅ Semantic chunking via dot product similarity (not just token count)
✅ Hybrid vectors: TF-IDF sparse + Voyage AI dense (1024-dim)
✅ Flexible deployment: Hybrid mode or dense-only (no joblib needed)
✅ AI enrichment: Gemini extracts metadata and semantic keywords
✅ Adaptive batching: Automatic retry with smaller batches on failures
✅ Production-ready: Encoding detection, fallbacks, error handling

Output Structure

Each chunk contains:

{
  "texto": "KEYWORDS:lead generation cold email CONTEXT:Explains outreach strategy... [original text]",
  "dense_vector": [0.023, -0.041, ...],  // 1024 dims
  "sparse_vector": {
    "indices": [45, 89, 123],
    "values": [0.82, 0.71, 0.65]
  },
  "id": "unique_chunk_id",
  "fitxer_origen": "source_document.md"
}

Performance

Tested on 200+ YouTube transcripts:

Processing: ~50-100 chunks/minute
Gemini batching: 100 chunks/call
Embedding: ~900 chunks/API call
Upload: ~200 chunks/second

Project Structure

rag-chunking-pipeline/
├── src/
│   ├── chunk_builder.c         # Semantic chunker (C)
│   ├── sparser.py              # TF-IDF sparse vectors
│   ├── chunk_richer.py         # Gemini enrichment
│   ├── final_embedder.py       # Dense embeddings
│   ├── qdrant_uploader.py      # Upload to Qdrant
│   └── orquester.py            # Pipeline orchestrator
├── setup.sh                    # Initialization script
├── requirements.txt
└── README.md

Use Cases

📚 Documentation search (technical docs, wikis)
⚖️ Legal/compliance (regulations, contracts)
🔬 Research papers (academic literature)
💬 Customer support (knowledge bases)
🎓 Educational content (courses, tutorials)

Advanced Configuration

Sparse Vector Generation

# sparser.py
custom_stopwords = set([...])  # Customize stopwords
max_df=0.8  # Max document frequency
min_df=2    # Min document frequency

AI Enrichment

# chunk_richer.py
BATCH_SIZES = [100, 80, 60, 40, 30, 20, 10]  # Adaptive batching
model = "gemini-2.5-pro"

Dense Embeddings

# final_embedder.py
BATCH_SIZE = 900
model = "voyage-3.5"
output_dimension = 1024

Deployment Modes Comparison

Feature	Hybrid Mode	Dense-Only Mode
Retrieval Quality	⭐⭐⭐⭐⭐ Best	⭐⭐⭐⭐ Excellent
Keyword Matching	✅ TF-IDF boosting	❌ No exact matching
Query Processing	Requires joblib	Zero preprocessing
Deployment	Server/VM	Serverless/Edge
Use Case	Maximum precision	Simplified deployment

License

MIT License - see LICENSE for details.

Acknowledgments

Built for processing AI agency content. Powered by Gemini 2.5 Pro, Voyage AI 3.5, and Qdrant.

Built with ❤️ for RAG enthusiasts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Chunking Pipeline

Overview

How It Works

1. Sentence Splitting

2. Semantic Chunking (Dot Product Similarity)

3. TF-IDF Sparse Vectors + Joblib

4. AI Enrichment with Gemini

5. Dense Embeddings

6. Qdrant Upload (Flexible Modes)

Quick Start

1. Installation

2. Configure Environment

3. Run the Pipeline

Pipeline Architecture

Key Features

Output Structure

Performance

Project Structure

Use Cases

Advanced Configuration

Sparse Vector Generation

AI Enrichment

Dense Embeddings

Deployment Modes Comparison

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.bat		setup.bat
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

RAG Chunking Pipeline

Overview

How It Works

1. Sentence Splitting

2. Semantic Chunking (Dot Product Similarity)

3. TF-IDF Sparse Vectors + Joblib

4. AI Enrichment with Gemini

5. Dense Embeddings

6. Qdrant Upload (Flexible Modes)

Quick Start

1. Installation

2. Configure Environment

3. Run the Pipeline

Pipeline Architecture

Key Features

Output Structure

Performance

Project Structure

Use Cases

Advanced Configuration

Sparse Vector Generation

AI Enrichment

Dense Embeddings

Deployment Modes Comparison

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages