This is an application to search a term across multiple documents. Three approaches are implemented to provide the search capability with variations to improve performance and enable meaningful word search. The searches ignore letter cases with the thought process that its not required for an e-commerce site's search functionality.
The application services are accessible by rest enabled end-points for easy plug in to the UI client code.
The most simplistic approach of all where every document is visited to find the number of occurrences. This search returns a result where the query term may be a substring of a larger word.
This search is not optimal as every time a user does a search, all documents are traversed to return the relevant ones. Here relevance is defined by the frequency of the word in the documents that contain it.
A LRU cache is included as part of the implementation to speed up responses for frequently searched or recently used terms.
This feature allows a user to retrieve documents with exact term matches along with other related matches. Currently only the count of the related match is returned as that is the expected behavior defined in the case study. But with simple tweaks even that can be displayed in the output.
This functionality can also accept a user defined regular expression and return count of the matches based on it.
This search mechanism precomputes the inverted document map which also contains the term frequency. These pre-computations are held in-memory after the first time its created and then subsequent searches are looked up in it.
Follow the following steps to run locally on your system:
- Install python3 (preferably)
- Clone the repo to a location in your system
- Start a terminal from inside the cloned repo directory and install dependencies by running
pip3 install --user -r requirements.txt
- Execute following command to interact with the terminal interface
python3 search.py
- (Optional) Start the flask server by running
python3 server.py
- Open a browser and the endpoints will render the search results. The three search capabilities are exposed by the following endpoints
localhost:8080/api/stringsearch/<place your search term here>
localhost:8080/api/regexsearch/<place your search term here>
localhost:8080/api/indexsearch/<place your search term here>
Follow the following steps to run locally on your system:
- Configure and start an EC2 instance on AWS.
- Configure the associated Security Group for the client by adding an inbound rule to include custom TCP traffic for port 8080 with source as 'Anywhere'.
- Follow the local system setup. The flask server should be started within a TMUX session so that it operates even after the ssh session dies.
The three different approaches were stress tested with 2 million random words. The execution time was noted to guage the rate of search result retrieval.
- string_search performance: 1303.7215265 s
- regex_search performance: 4704.5470226 s
- index_search performance: 5.030461999999716 s
The indexed search beats the other searches because of the quick lookup from the dictionary that map words to the list of documents containing it.
The test suite for the project can be run by executing the following command:
python3 -m unittest
- For a real-time scenario it would help to store the inverted index in a persistent storage like database
- Large requests in the order of more than 5000 requests per second can be catered by -
- Allocating more RAM that could support a larger cache for frequently searched terms
- Running separate thread to update the inverted index as an when new documents arrive in the datastore
- Compute the inverted index by running parallel spark jobs on data distributed across multiple nodes
- Probabilistically load the next set of documents based on current term to reduce subsequent search time
- Gauge network traffic to add and remove more CPUs on the fly to deal with spikes and ebbs