PDF-RAG: Retrieval-Augmented Generation for PDFs

Welcome to PDF-RAG, a simple pipeline that lets you upload and interact with your PDFs. This repository provides an easy-to-use framework for building a conversational interface for document interaction.

Getting Started

Set up your environment:

python3 -m venv my_new_env
source my_new_env/bin/activate

Follow the notebook: Open rag_demo.ipynb for step-by-step instructions. We’ve also included a test set of questions, answers, and source documents in cbo_questions.xlsx to evaluate the pipeline. Sample documents from the Congressional Budget Office are included in the cbo_documents folder for testing. To make this work, please go to "keys.txt" file and paste your Hugging Face token to access the models.

Pipeline Overview

Processing PDFs: PDFs are chunked into manageable pieces using LangChain.
Embeddings:
- We tested FinLang (for financial documents) and sentence-t5-base (for general use).
- Embeddings are managed using Faiss, which is optimized for fast similarity searches.
Generation: Llama-2-7b-chat for the conversational interface. We load quantized version of the model via bitsandbytes so a decent GPU should handle it.

User Interfaces

Interact with your PDFs using the following code snippets. You can also try out on Google Colab

Streamlit:
```
streamlit run streamlit_UI.py
```
Gradio:
```
python gradio_UI.py
```

Notes

The pipeline works great for text-heavy documents but isn’t the best fit for those with complex multi-modal content just yet. Don’t worry—we’re cooking up an update to tackle that with a multi-modal RAG pipeline powered by Vision LLMs. Actually, we did, access MultiModel-RAG-ColPali-Qdrant-Qwen for all the materials. We’d love to hear your feedback or questions. Cheers!

Erdi: [email protected] Furkan: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
cbo_documents		cbo_documents
.gitignore		.gitignore
cbo_questions.xlsx		cbo_questions.xlsx
gradio_UI.py		gradio_UI.py
keys.txt		keys.txt
rag_demo.ipynb		rag_demo.ipynb
rag_main.py		rag_main.py
readme.md		readme.md
requirements.txt		requirements.txt
streamlit_UI.py		streamlit_UI.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF-RAG: Retrieval-Augmented Generation for PDFs

Getting Started

Pipeline Overview

User Interfaces

Notes

About

Releases

Packages

Languages

erkara/PDF-RAG

Folders and files

Latest commit

History

Repository files navigation

PDF-RAG: Retrieval-Augmented Generation for PDFs

Getting Started

Pipeline Overview

User Interfaces

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages