🚀 A Node.js-based RAG (Retrieval-Augmented Generation) system that converts PDFs into searchable knowledge bases. Built with LangChain and Google's Generative AI, optimized for technical content like DSA documentation.
- PDF document loading and processing
- Text chunking with customizable size and overlap
- Embedding generation using Google's Generative AI
- Vector storage and retrieval using Pinecone
- Environment-based configuration
- Node.js v20 or later
- A Google AI (Gemini) API key
- A Pinecone account and API key
- A PDF document to process: In this implementation, we use a Data Structures and Algorithms (DSA) PDF, with system configuration optimized for technical content.
- Clone the repository:
git clone [your-repo-url]
cd RAG- Install dependencies:
npm install- Create a
.envfile in the root directory with the following variables:
GEMINI_API_KEY=your_gemini_api_key
PINECONE_INDEX_NAME=your_pinecone_index_name
PINECONE_ENVIRONMENT= your_pincone_env
PINECONE_INDEX_NAME= your_pincone_index_nameindex.js- Main application file handling PDF processing and vector storagequery.js- (To be implemented) Query interface for the RAG systemdsa.pdf- Sample PDF document for processing
The project uses the following configuration parameters:
- Chunk Size: 1000 characters
- Chunk Overlap: 200 characters
- Text Embedding Model: 'text-embedding-004'
- Max Concurrency for Pinecone uploads: 5
@langchain/google-genai: For generating embeddings@langchain/textsplitters: For splitting text into chunks@langchain/community: For PDF loading capabilities@langchain/pinecone: For vector store operations@pinecone-database/pinecone: For Pinecone database interactionsdotenv: For environment variable managementpdf-parse: For PDF parsing
-
PDF Loading: The system uses LangChain's PDFLoader to load and parse PDF documents.
-
Text Chunking: Documents are split into smaller chunks using RecursiveCharacterTextSplitter with configurable chunk size and overlap.
-
Embedding Generation: Each chunk is converted into embeddings using Google's Generative AI model.
-
Vector Storage: The embeddings are stored in a Pinecone vector database for efficient similarity search.
-
Place your PDF document in the project directory and update the
PDF_PATHinindex.jsif needed. -
Run the indexing process:
node index.js- (To be implemented) Run queries against your indexed document:
node query.js "your question here"The system includes basic error handling for:
- PDF loading failures
- Text processing errors
- API connection issues
- Environment variable missing errors
- Implement query functionality
- Add support for multiple PDF documents
- Include progress indicators for long-running operations
- Add batch processing capabilities
- Implement caching for frequently accessed vectors
ISC