GitHub - truongbvinh/Search-Engine: Search engine for ics subdomain websites

Hello everyone! This repo is a utilization of the links that are obtained through the web-crawler, but the only information needed to construct the search engine is:

URLs
HTML of each URL

This search engine runs off a general formula called tf-idf, which gives a score to URLs based on the user's search query. Tf-idf stands for term frequency inverted document frequency. In this case, each URL = document.

What this means is that for a given query, a score will be added to each URL for each word in the query. A document that has many uses of the query term will have a higher score. The fewer amount of documents that have the query term will give documents that DO have the term a higher score.

To make the calculation of tf-idf faster, we must construct and Inverted Index. This Inverted Index is implemented as a hash table; each term is a key, and in the value we store another hash table. The nested hash table has keys of each document, and the value is the term frequency score of the document.

{term1: {doc1: 5, doc2: 7, ...}, term2: {doc1: 3, doc3: 4, ...}}

The outer dictionary is the Inverted Index, and each inner dictionary is called a Postings List. Having these two components allows for quick computation of tf-idf scores.

After completing the calculation for raw, unedited tf-idf scores, I implemented different modifiers for each term found. For instance, if the search term was found inside a header tag, then I'd give it more weight than a term found in the body tag.

Since storing this Inverted Index in a raw form can take up huge amounts of space (originally, my index was ~100MB), I compressed the terms by simply writing to a file with an underscore separation between each term. The postings list for each term is compressed by storing the delta document ID, rather than storing each document ID. This saves space especially for terms found across many documents.

For instance, if a list has two docIDs: 100000, 100001 we would instead store the docIDs as : 100000, 1

If we were storing this in a .txt file, then we saved 5 bytes by compressing one term.

After implementing a simple version of these two compressions, the total space of the inverted index is ~45MB! More than half the space saved than just using json.dumps!

A future improvement is to use a database program (such as MySQL, MongoDB, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
IIndex.txt.zip		IIndex.txt.zip
IndexCreation.py		IndexCreation.py
LICENSE		LICENSE
README.md		README.md
compressed_index.txt		compressed_index.txt
compressed_postings.txt		compressed_postings.txt
gui.py		gui.py
total_docs.txt		total_docs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

License

truongbvinh/Search-Engine

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages