Ask questions from your PDF, DOCX, or TXT files using powerful open-source tools — all in Google Colab.
This project allows you to upload a document and ask it natural language questions. Behind the scenes, it uses:
- 🧩 LangChain – to handle document loading, text splitting, and chaining logic
- 🔍 FAISS – to store and retrieve semantically similar chunks
- 🤗 HuggingFace Transformers – to run a local language model for question-answering
- ⚡ Google Colab – no setup required, runs in the cloud
✅ Upload PDF, DOCX, or TXT
✅ Extract and chunk document contents
✅ Embed chunks using sentence‑transformers
✅ Store & search in a FAISS vector index
✅ Use a small, efficient language model (Flan‑T5) for answers
✅ Interactive Q&A loop
!pip install langchain langchain-community
!pip install faiss-cpu
!pip install pypdf python-docx
!pip install sentence-transformers
!pip install transformersfrom google.colab import files
uploaded = files.upload()
file_path = list(uploaded.keys())[0]
print("Uploaded:", file_path)from langchain_community.document_loaders import PyPDFLoader, TextLoader, Docx2txtLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Choose loader based on file type
if file_path.endswith(".pdf"):
loader = PyPDFLoader(file_path)
elif file_path.endswith(".docx"):
loader = Docx2txtLoader(file_path)
else:
loader = TextLoader(file_path)
docs = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
documents = splitter.split_documents(docs)
print(f"Total Chunks: {len(documents)}")from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embeddings)from transformers import pipeline
from langchain_community.llms import HuggingFacePipeline
flan_pipeline = pipeline(
"text2text-generation",
model="google/flan-t5-base", # Swap to flan-t5-small for faster inference
max_length=512
)
llm = HuggingFacePipeline(pipeline=flan_pipeline)from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
chain_type="stuff"
)query = "Give me a short summary of the document"
print(qa.run(query))while True:
q = input("Ask a question (or 'exit'): ")
if q.lower() == "exit":
break
print("Answer:", qa.run(q))Ask a question (or 'exit'): what is the importance of a healthy diet?
Answer: helps to protect against malnutrition in all its forms, as well as noncommunicable diseases (NCDs) such as diabetes, heart disease, stroke and cancer
.pdf– parsed usingPyPDFLoader.docx– parsed usingDocx2txtLoader.txt– parsed usingTextLoader