The following changes were made to this version: · AdamMeyers/The_Termolator@a01ee98

Commit

The following changes were made to this version:

1) Removed stemming procedures from the distributional component and replace it
with: lemmatizations from the .terms file.
2) For distributional purposes, count alternative forms of the same lemma as instances
of the same term.
3) Removed substring procedures from distributional process and instead
make them part of the chunking procedures (preprocessors to the
distributional system). Treat cases where prepositional phrases are
converted to prenominals specially (e.g., full term = Recognition
of Speech, lemma = Speech Recognition). In particular, limit
substring relations for these cases.
4) Fixed a bug in abbreviate that occurs if white space is made part of an abbreviation sequence.
5) Added an option for webscore to use www.webcorp.org.uk search
engine. This is experimental. It is slower than Yahoo. However, it
may be possible to download webcorp and use it internally. In that
case, it would have some advantages over Yahoo: it would not necessarily
be slower; and it would be more resilient.
6) Removed an unnecessary print statement in filter_term_output.py
7) Started using make_final_output_file.py to produce .out_term_list file
instead of shell command. There is a change in output format. Each
line consists of the lemma of the term, followed by variants,
separated by tabs.
8) Added possibly_create_abbreviate_dicts.py to create missing abbreviate
dictionaries.
9) Added a way of running the program that is customized to using
single files as foreground with the same background. For example,
we are currently generating one set of terms using each supreme court
decision as foreground and the full set of supreme court decisions as
background. In future versions, we will provide an example from this
run (using the shell script run_termolator_with_1_file_foreground.sh
10) The changed described above significantly reduce the system's dependency
on NLTK and it is possible that we will add a future version that makes
installation of NLTK unnecessary.
11) We are now using argument 12 in the run_termolator.sh script as the prefix
for multiple shared caches of information to reuse. Previously, it was just
used as a prefix for the webscore file, but now it is also used as a prefix
for the lemma dictionary file. The current assumption is that one such file
should be used for each widely defined topic.
12) Updated the Readme to reflect these changes
13) Updated the test file directories (gutenberg-test, OANC-test and patent-test)
14) Will make this release a new release number

Loading branch information

Adam Meyers committed Sep 20, 2017

1 parent 2ac3be8 commit a01ee98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

Sorry, this diff is taking too long to generate.

There are no files selected for viewing

0 comments on commit `a01ee98`

Commit

Sorry, this diff is taking too long to generate.

There are no files selected for viewing

0 comments on commit a01ee98

0 comments on commit `a01ee98`