GitHub - amartyabasu/target-case-study

Table of content

Design Details
Running the App
- Local execution
- AWS Deployment
Performance Analysis
Unit Test
Future Enhancements

Design Details

This is an application to search a term across multiple documents. Three approaches are implemented to provide the search capability with variations to improve performance and enable meaningful word search. The searches ignore letter cases with the thought process that its not required for an e-commerce site's search functionality.

The application services are accessible by rest enabled end-points for easy plug in to the UI client code.

String Search

The most simplistic approach of all where every document is visited to find the number of occurrences. This search returns a result where the query term may be a substring of a larger word.

This search is not optimal as every time a user does a search, all documents are traversed to return the relevant ones. Here relevance is defined by the frequency of the word in the documents that contain it.

A LRU cache is included as part of the implementation to speed up responses for frequently searched or recently used terms.

Regex Search

This feature allows a user to retrieve documents with exact term matches along with other related matches. Currently only the count of the related match is returned as that is the expected behavior defined in the case study. But with simple tweaks even that can be displayed in the output.

This functionality can also accept a user defined regular expression and return count of the matches based on it.

Index search

This search mechanism precomputes the inverted document map which also contains the term frequency. These pre-computations are held in-memory after the first time its created and then subsequent searches are looked up in it.

Running the App

Local execution

Follow the following steps to run locally on your system:

Install python3 (preferably)
Clone the repo to a location in your system
Start a terminal from inside the cloned repo directory and install dependencies by running

pip3 install --user -r requirements.txt

Execute following command to interact with the terminal interface

python3 search.py

(Optional) Start the flask server by running

python3 server.py

Open a browser and the endpoints will render the search results. The three search capabilities are exposed by the following endpoints

localhost:8080/api/stringsearch/<place your search term here>
localhost:8080/api/regexsearch/<place your search term here>
localhost:8080/api/indexsearch/<place your search term here>

AWS Deployment

Follow the following steps to run locally on your system:

Configure and start an EC2 instance on AWS.
Configure the associated Security Group for the client by adding an inbound rule to include custom TCP traffic for port 8080 with source as 'Anywhere'.
Follow the local system setup. The flask server should be started within a TMUX session so that it operates even after the ssh session dies.

Performance Analysis

The three different approaches were stress tested with 2 million random words. The execution time was noted to guage the rate of search result retrieval.

string_search performance: 1303.7215265 s
regex_search performance: 4704.5470226 s
index_search performance: 5.030461999999716 s

The indexed search beats the other searches because of the quick lookup from the dictionary that map words to the list of documents containing it.

Unit Test

The test suite for the project can be run by executing the following command:

python3 -m unittest

Future enhancements

For a real-time scenario it would help to store the inverted index in a persistent storage like database
Large requests in the order of more than 5000 requests per second can be catered by -
1. Allocating more RAM that could support a larger cache for frequently searched terms
2. Running separate thread to update the inverted index as an when new documents arrive in the datastore
3. Compute the inverted index by running parallel spark jobs on data distributed across multiple nodes
4. Probabilistically load the next set of documents based on current term to reduce subsequent search time
5. Gauge network traffic to add and remove more CPUs on the fly to deal with spikes and ebbs

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
sample_text		sample_text
.gitignore		.gitignore
README.md		README.md
performance_test.py		performance_test.py
requirements.txt		requirements.txt
search.py		search.py
server.py		server.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of content

Design Details

String Search

Regex Search

Index search

Running the App

Local execution

AWS Deployment

Performance Analysis

Unit Test

Future enhancements

About

Uh oh!

Releases

Packages

Languages

amartyabasu/target-case-study

Folders and files

Latest commit

History

Repository files navigation

Table of content

Design Details

String Search

Regex Search

Index search

Running the App

Local execution

AWS Deployment

Performance Analysis

Unit Test

Future enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages