Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
The following changes were made to this version:
1) Removed stemming procedures from the distributional component and replace it with: lemmatizations from the .terms file. 2) For distributional purposes, count alternative forms of the same lemma as instances of the same term. 3) Removed substring procedures from distributional process and instead make them part of the chunking procedures (preprocessors to the distributional system). Treat cases where prepositional phrases are converted to prenominals specially (e.g., full term = Recognition of Speech, lemma = Speech Recognition). In particular, limit substring relations for these cases. 4) Fixed a bug in abbreviate that occurs if white space is made part of an abbreviation sequence. 5) Added an option for webscore to use www.webcorp.org.uk search engine. This is experimental. It is slower than Yahoo. However, it may be possible to download webcorp and use it internally. In that case, it would have some advantages over Yahoo: it would not necessarily be slower; and it would be more resilient. 6) Removed an unnecessary print statement in filter_term_output.py 7) Started using make_final_output_file.py to produce .out_term_list file instead of shell command. There is a change in output format. Each line consists of the lemma of the term, followed by variants, separated by tabs. 8) Added possibly_create_abbreviate_dicts.py to create missing abbreviate dictionaries. 9) Added a way of running the program that is customized to using single files as foreground with the same background. For example, we are currently generating one set of terms using each supreme court decision as foreground and the full set of supreme court decisions as background. In future versions, we will provide an example from this run (using the shell script run_termolator_with_1_file_foreground.sh 10) The changed described above significantly reduce the system's dependency on NLTK and it is possible that we will add a future version that makes installation of NLTK unnecessary. 11) We are now using argument 12 in the run_termolator.sh script as the prefix for multiple shared caches of information to reuse. Previously, it was just used as a prefix for the webscore file, but now it is also used as a prefix for the lemma dictionary file. The current assumption is that one such file should be used for each widely defined topic. 12) Updated the Readme to reflect these changes 13) Updated the test file directories (gutenberg-test, OANC-test and patent-test) 14) Will make this release a new release number
- Loading branch information