Semantic Search Engine is an intelligent PDF-based search tool that allows you to ask natural language questions and get semantically relevant answers from documents.
It uses Sentence Transformers for embeddings and FAISS for vector similarity search.
Didnt used LanagChain, Langchain had lot of abstractions that made this alot simpler and efficient. But I didnt wanted it to be simple. So, went the traditional way.
- PDF Reader – Upload and parse text from PDF files.
- Text Chunking – Splits large documents into smaller sentence chunks for better embedding and retrieval.
- Embeddings – Uses
all-MiniLM-L6-v2from Sentence Transformers to generate semantic embeddings. - Vector Search – Leverages FAISS to store and query embeddings efficiently.
- Question Answering – Ask a question like "What is the relevance of Blockchain?" and retrieve the most relevant chunks.
python main.pySample output:
Total chunks: 42
2
[[0.23 0.45]] [[12 33]]
['Blockchain enables secure and transparent transactions ...'] distance: 0.23
['Distributed ledgers provide ...'] distance: 0.45- Extract text from PDFs using PyPDF2.
- Tokenize sentences using NLTK.
- Split into chunks (approx. 100 words each).
- Encode sentences into embeddings using Sentence Transformers.
- Build a FAISS index to store embeddings.
- Search queries against the index to find relevant chunks.
##📦 Prerequisites
- Python 3.9+ (tested on 3.11 / 3.13)
- Virtual Environment (venv) recommended Install dependencies:
pip install PyPDF2 nltk sentence-transformers faiss-cpu numpy