PDF-based RAG Implementation

🚀 A Node.js-based RAG (Retrieval-Augmented Generation) system that converts PDFs into searchable knowledge bases. Built with LangChain and Google's Generative AI, optimized for technical content like DSA documentation.

Features

PDF document loading and processing
Text chunking with customizable size and overlap
Embedding generation using Google's Generative AI
Vector storage and retrieval using Pinecone
Environment-based configuration

Prerequisites

Node.js v20 or later
A Google AI (Gemini) API key
A Pinecone account and API key
A PDF document to process: In this implementation, we use a Data Structures and Algorithms (DSA) PDF, with system configuration optimized for technical content.

Installation

Clone the repository:

git clone [your-repo-url]
cd RAG

Install dependencies:

npm install

Create a .env file in the root directory with the following variables:

GEMINI_API_KEY=your_gemini_api_key
PINECONE_INDEX_NAME=your_pinecone_index_name
PINECONE_ENVIRONMENT= your_pincone_env
PINECONE_INDEX_NAME= your_pincone_index_name

Project Structure

index.js - Main application file handling PDF processing and vector storage
query.js - (To be implemented) Query interface for the RAG system
dsa.pdf - Sample PDF document for processing

Configuration

The project uses the following configuration parameters:

Chunk Size: 1000 characters
Chunk Overlap: 200 characters
Text Embedding Model: 'text-embedding-004'
Max Concurrency for Pinecone uploads: 5

Dependencies

@langchain/google-genai: For generating embeddings
@langchain/textsplitters: For splitting text into chunks
@langchain/community: For PDF loading capabilities
@langchain/pinecone: For vector store operations
@pinecone-database/pinecone: For Pinecone database interactions
dotenv: For environment variable management
pdf-parse: For PDF parsing

Implementation Details

PDF Loading: The system uses LangChain's PDFLoader to load and parse PDF documents.
Text Chunking: Documents are split into smaller chunks using RecursiveCharacterTextSplitter with configurable chunk size and overlap.
Embedding Generation: Each chunk is converted into embeddings using Google's Generative AI model.
Vector Storage: The embeddings are stored in a Pinecone vector database for efficient similarity search.

Usage

Place your PDF document in the project directory and update the PDF_PATH in index.js if needed.
Run the indexing process:

node index.js

(To be implemented) Run queries against your indexed document:

node query.js "your question here"

Error Handling

The system includes basic error handling for:

PDF loading failures
Text processing errors
API connection issues
Environment variable missing errors

Future Improvements

Implement query functionality
Add support for multiple PDF documents
Include progress indicators for long-running operations
Add batch processing capabilities
Implement caching for frequently accessed vectors

License

ISC

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
dsa.pdf		dsa.pdf
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
query.js		query.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF-based RAG Implementation

Features

Prerequisites

Installation

Project Structure

Configuration

Dependencies

Implementation Details

Usage

Error Handling

Future Improvements

License

May Refer to this awesome blog for RAG Implementation : https://certain-mechanic-42c.notion.site/RAG-System-23c3a78e0e22801caa04d16f95df1825

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF-based RAG Implementation

Features

Prerequisites

Installation

Project Structure

Configuration

Dependencies

Implementation Details

Usage

Error Handling

Future Improvements

License

May Refer to this awesome blog for RAG Implementation : https://certain-mechanic-42c.notion.site/RAG-System-23c3a78e0e22801caa04d16f95df1825

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages