Skip to content

kunal30114/DSA-Helper-RAG-Implementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF-based RAG Implementation

🚀 A Node.js-based RAG (Retrieval-Augmented Generation) system that converts PDFs into searchable knowledge bases. Built with LangChain and Google's Generative AI, optimized for technical content like DSA documentation.

Node.js LangChain Pinecone

Features

  • PDF document loading and processing
  • Text chunking with customizable size and overlap
  • Embedding generation using Google's Generative AI
  • Vector storage and retrieval using Pinecone
  • Environment-based configuration

Prerequisites

  • Node.js v20 or later
  • A Google AI (Gemini) API key
  • A Pinecone account and API key
  • A PDF document to process: In this implementation, we use a Data Structures and Algorithms (DSA) PDF, with system configuration optimized for technical content.

Installation

  1. Clone the repository:
git clone [your-repo-url]
cd RAG
  1. Install dependencies:
npm install
  1. Create a .env file in the root directory with the following variables:
GEMINI_API_KEY=your_gemini_api_key
PINECONE_INDEX_NAME=your_pinecone_index_name
PINECONE_ENVIRONMENT= your_pincone_env
PINECONE_INDEX_NAME= your_pincone_index_name

Project Structure

  • index.js - Main application file handling PDF processing and vector storage
  • query.js - (To be implemented) Query interface for the RAG system
  • dsa.pdf - Sample PDF document for processing

Configuration

The project uses the following configuration parameters:

  • Chunk Size: 1000 characters
  • Chunk Overlap: 200 characters
  • Text Embedding Model: 'text-embedding-004'
  • Max Concurrency for Pinecone uploads: 5

Dependencies

  • @langchain/google-genai: For generating embeddings
  • @langchain/textsplitters: For splitting text into chunks
  • @langchain/community: For PDF loading capabilities
  • @langchain/pinecone: For vector store operations
  • @pinecone-database/pinecone: For Pinecone database interactions
  • dotenv: For environment variable management
  • pdf-parse: For PDF parsing

Implementation Details

  1. PDF Loading: The system uses LangChain's PDFLoader to load and parse PDF documents.

  2. Text Chunking: Documents are split into smaller chunks using RecursiveCharacterTextSplitter with configurable chunk size and overlap.

  3. Embedding Generation: Each chunk is converted into embeddings using Google's Generative AI model.

  4. Vector Storage: The embeddings are stored in a Pinecone vector database for efficient similarity search.

Usage

  1. Place your PDF document in the project directory and update the PDF_PATH in index.js if needed.

  2. Run the indexing process:

node index.js
  1. (To be implemented) Run queries against your indexed document:
node query.js "your question here"

Error Handling

The system includes basic error handling for:

  • PDF loading failures
  • Text processing errors
  • API connection issues
  • Environment variable missing errors

Future Improvements

  • Implement query functionality
  • Add support for multiple PDF documents
  • Include progress indicators for long-running operations
  • Add batch processing capabilities
  • Implement caching for frequently accessed vectors

License

ISC

May Refer to this awesome blog for RAG Implementation : https://certain-mechanic-42c.notion.site/RAG-System-23c3a78e0e22801caa04d16f95df1825

About

A Node.js-based RAG (Retrieval-Augmented Generation) system that converts PDFs into searchable knowledge bases. Built with LangChain and Google's Generative AI for LLM functionality and PineCone DB for vector database storage.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors