Multiple String Matching

This project uses Aho-Corasick algorithm to find lines in a target file that matches one or more strings in the dictionary file.

Objective

This project aims to achieve three objectives:

Dictionary filtering: filter the words in the string dictionary by several documents so the filtered dictionary does not contain common words.
Single matching: output the lines in the target file that matches one or more words in the string dictionary.
Double matching: output the lines in the target file that matches two or more words in the string dictionary.

This project uses a self-modified implementation of Aho-Corasick algorithm on Github page.

Prerequisite

Cmake

Boost

Installation

 git clone https://github.com/suixin661014/protein_pair_extraction
 cd protein_pair_extraction
 mkdir build && cd build 
 cmake -DCMAKE_BUILD_TYPE=Release ..
 make

Usage

The code builds two executables: 1) protein_filter and 2) protein_pair. It is suggested to move the two executables to the directory where the dictionary file (protein name dictionary) and the target file (senteces of abstracts) are.

./protein_filter protein_name_file [threshold] [output]

will filter the dictionary protein_name_file using all .txt files in subdirectory filter as targets. An entry in the dictrionary will be removed if it occures more than threshold times in any of those files. The result will be printed out to output. By default, threshold=1 and output=stdout.

./protein_pair protein_name_file pubmed_abstract_file [output]

will find sentences in target pubmed_abstract_file that contains 2+ entries in protein_name_file, and write output to output. By default, output=stdout.

Limitation

The current version does not implement single matching, because there is no time. The target files, e.g, pubmed_abstract_file should contain the following format "label \t sentence". The aforementioned sentence should be word-separated, that "ABC, DEF." should be pre-processed into "ABC , DEF ."

Known issues

The implementation of Aho-Corasick algorithm stores local objects when there is a matching, and thus will suffer from performance issues when each sentence is too long. It's implementation of "only_whole_words()" is also not efficient. There is a work-around, but we will continue to use it as long as it gives acceptable performance.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multiple String Matching

Objective

Prerequisite

Installation

Usage

Limitation

Known issues

About

Releases

Packages

Languages

luckynozomi/protein_pair_extraction

Folders and files

Latest commit

History

Repository files navigation

Multiple String Matching

Objective

Prerequisite

Installation

Usage

Limitation

Known issues

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages