From fd07bfc58ab5ce8d5ee41d5e563d3c8d42e992db Mon Sep 17 00:00:00 2001 From: Darlene Stewart Date: Tue, 28 May 2019 02:42:06 -0400 Subject: [PATCH] Added support for subword unit view and corresponding subword unit embeddings. Squashed merge commit of the following: commit e9367f8f94ffd09fe76b0c12d9ace54a79533a4f Author: Darlene Stewart Date: Mon May 27 18:18:33 2019 -0400 Fixed YiSi SRL test reference files yet again. commit a7145d50093c360642b8da68c2501c5333973ab5 Merge: ea751e4 d1fe9d8 Author: Darlene Stewart Date: Mon May 27 17:54:35 2019 -0400 Merge branch 'dev.merge.NRC-private' Merge NRC-private commits db8a070 through f35fd82: git cherry-pick -e -x db8a070 git cherry-pick -e -x db8a070..96a8a7d git cherry-pick -e -x f35fd82 Also, fixed code formatting and YiSi SRL test reference files. commit d1fe9d816c06940cd91a85beb0ca63680bae9114 Author: Darlene Stewart Date: Mon May 27 17:39:21 2019 -0400 Fixed the YiSi SRL test reference files again. commit 7f3931c178255a4a1ecac013691cbee915f3a8e6 Author: Darlene Stewart Date: Mon May 27 17:43:36 2019 -0400 Code formatting fixes commit d0022bcd5b53b5787f2ca4e8c628d345d4e2c8f7 Author: Darlene Stewart Date: Mon May 27 14:27:02 2019 -0400 Fixed copyright year. commit 748df7c0dcef368720dc7b535d1838d6019bb9ad Author: Darlene Stewart Date: Thu May 23 17:14:16 2019 -0400 Update YiSi test reference files to match output on NRC-private branch. (cherry picked from NRC-private branch commit f35fd82e646f1d60143b4976ff8d58c93760544b) commit 986b61974cfc8ed08e7e06d434b8195916f457e0 Author: Darlene Stewart Date: Wed May 22 16:59:44 2019 -0400 Fixed an int-size_t comparison causing a g++ warning in yisi::read_sent(). (cherry picked from NRC-private branch commit 96a8a7d652888f05bc5fa95c60c243591cafc543) commit 08b31bfa966af2cb4e965ecb97dfc4cf8047c7ba Author: Darlene Stewart Date: Wed May 22 16:20:00 2019 -0400 Update YiSi test reference files to match the current state of the NRC-private branch. (cherry picked from NRC-private branch commit 5b5f4a25e69625cd1b8a0f7a8d8974e0ce2b0e60) commit 5f60ca583f0b127431c04b87dcab6d606ca6214b Author: Darlene Stewart Date: Wed May 22 14:39:23 2019 -0400 Fixed read_con1109 to set the sentence tokens for 'word' type. (cherry picked from NRC-private branch commit 2bd0c3f9e1c9029494619d2175c187a9605ca635) commit 745004781aeabbdd1cbdfbed198de473f1706ab2 Author: Jackie Lo Date: Tue May 14 18:27:46 2019 -0400 Bug fix for reading srl parse in conll09 format. (cherry picked from NRC-private branch commit 4b8740d2ef1c23390bcef3d0fe4a6acdcc446dfe) commit 875171ae255dbe5a5ae7ce275e2aed5d13819373 Author: Jackie Lo Date: Wed May 8 07:24:43 2019 -0400 Rewriting confusing progress message in main function. (cherry picked from NRC-private branch commit f29066dceba8ec5ac92b33e5142d7075f78aeafc) commit 783ee3611d90a46503528c21b6b1b3e1efd96f69 Author: Jackie Lo Date: Wed May 8 07:15:35 2019 -0400 Another bug fix in reading conll09 formatted srl. (cherry picked from NRC-private branch commit 675e73d35dd231d39cc415a0e7311f56052af981) commit 7d038c0d2d49b6d96182326d3174899bfa2fef03 Author: Jackie Lo Date: Wed May 8 00:01:41 2019 -0400 Bug fix in reading conll09 srl format. (cherry picked from NRC-private branch commit c3119343e2cceda62c37f3aed559c5f3014f7573) commit fb6e0b69589294f9bda98efd25ee143ba6341a98 Author: Jackie Lo Date: Tue May 7 16:00:15 2019 -0400 Added data structure for sentence. (cherry picked from NRC-private branch commit d5cdb883271ef8a30aedb5727613b57fb799821c) commit efa1ea1014ed88910e42572720cf626423bc1713 Author: Jackie Lo Date: Tue May 7 15:59:09 2019 -0400 Added some handy tools for general w2v embeddings analysis. (cherry picked from NRC-private branch commit 7bd1fca5447cfc176da26a48fdebb90f9d602da7) commit 2d81344c61d7e1dbeac55ab5943715c293c0ddd1 Author: Jackie Lo Date: Tue May 7 12:38:12 2019 -0400 Redesign of data structure for sentences to support additional subword unit view and corresponding subword unit embeddings. (cherry picked from NRC-private branch commit 1df49fec9be3196adfbd5ab990dd7a323ec9c7f6) --- src/Makefile | 3 +- src/emap_test.cpp | 74 ++ src/lexsim.cpp | 93 +- src/lexsim.h | 26 +- src/ngram_test.cpp | 43 + src/oov_test.cpp | 39 + src/overlapvocab_test.cpp | 49 ++ src/phrasesim.h | 286 ++++-- src/sent.cpp | 248 ++++++ src/sent.h | 65 ++ src/srl.cpp | 4 +- src/srl.h | 4 +- src/srl_test.cpp | 33 +- src/srlgraph.cpp | 80 +- src/srlgraph.h | 23 +- src/srlgraph_test.cpp | 19 +- src/srlmate.cpp | 45 +- src/srlmate.h | 6 +- src/srlmate_test.cpp | 18 +- src/srlutil.cpp | 212 +++-- src/srlutil.h | 19 +- src/srlutil_test.cpp | 6 +- src/util.cpp | 28 +- src/util.h | 12 +- src/yisi.cpp | 178 +++- src/yisigraph.cpp | 33 +- src/yisigraph.h | 161 ++-- src/yisiscorer.h | 1298 ++++++++++++++-------------- src/yisiscorer_test.cpp | 12 +- test/ref/srlgraph_test.out | 2 +- test/ref/srlutil_test.out | 20 +- test/ref/test_hyp.docyisi0 | 2 +- test/ref/test_hyp.docyisi1_srl | 2 +- test/ref/test_hyp.docyisi1_srl.alt | 2 +- test/ref/test_hyp.docyisi2_srl | 2 +- test/ref/test_hyp.docyisi2_srl.alt | 2 +- test/ref/test_hyp.sntyisi0 | 20 +- test/ref/test_hyp.sntyisi1_srl | 20 +- test/ref/test_hyp.sntyisi1_srl.alt | 20 +- test/ref/test_hyp.sntyisi2_srl | 16 +- test/ref/test_hyp.sntyisi2_srl.alt | 16 +- test/ref/test_ref.en.srl | 20 +- test/ref/test_ref.en.srl.alt | 22 +- test/ref/test_yisi_0.out | 6 +- test/ref/test_yisi_1.out | 6 +- test/ref/test_yisi_1_srl.out | 6 +- test/ref/test_yisi_2.out | 6 +- test/ref/test_yisi_2_srl.out | 6 +- 48 files changed, 2112 insertions(+), 1201 deletions(-) create mode 100644 src/emap_test.cpp create mode 100644 src/ngram_test.cpp create mode 100644 src/oov_test.cpp create mode 100644 src/overlapvocab_test.cpp create mode 100644 src/sent.cpp create mode 100644 src/sent.h diff --git a/src/Makefile b/src/Makefile index 4c2aff4..030ef51 100644 --- a/src/Makefile +++ b/src/Makefile @@ -12,7 +12,7 @@ # Override the value of MATEPLUS_HOME with a command line definition, or # consider defining MATEPLUS_HOME in your .profile, for example: # export MATEPLUS_HOME=~/u/sandboxes/mateplus -MATEPLUS_HOME ?= ~/u/tools/MATE/mateplus-master/src +MATEPLUS_HOME ?= /home/loc982/u/tools/mateplus MATETOOLS_HOME ?= $(MATEPLUS_HOME) @@ -47,6 +47,7 @@ LIBRARIES += -Wl,-Bstatic -lcmdlp -Wl,-Bdynamic PROG_NAMES := yisi TEST_NAMES := srlgraph_test maxmatching_test lexsim_test w2v_test biw2v_test \ lexweight_test phrasesim_test srl_test srlutil_test util_test \ + emap_test oov_test ngram_test overlapvocab_test \ yisiscorer_test testbin CMDLP_TEST_NAMES := cmdlp_test diff --git a/src/emap_test.cpp b/src/emap_test.cpp new file mode 100644 index 0000000..300b833 --- /dev/null +++ b/src/emap_test.cpp @@ -0,0 +1,74 @@ +/** + * @file lexsim_test.cpp + * @brief Unit test for lexsim. + * + * @author Jackie Lo + * + * Multilingual Text Processing / Traitement multilingue de textes + * Digital Technologies Research Centre / Centre de recherche en technologies numériques + * National Research Council Canada / Conseil national de recherches Canada + * Copyright 2019, Her Majesty in Right of Canada / + * Copyright 2019, Sa Majeste la Reine du Chef du Canada + */ + +#include "lexsim.h" + +#include +#include + +using namespace std; +using namespace yisi; + +int main(int argc, char* argv[]) +{ + string inpembpath = argv[1]; + string hypembpath = argv[2]; + string inpmappath = argv[3]; + string inpdocpath = argv[4]; + ofstream INPMAP; + ofstream INPDOC; + open_ofstream(INPMAP, inpmappath); + + map > inpemb; + map > hypemb; + map > inpfilemb; + int dim; + read_binw2v(inpembpath, inpemb, dim); + read_binw2v(hypembpath, hypemb, dim); + vector inpsents = read_file(inpdocpath); + //filter the inp emb according to the inp doc + set tokens; + for (auto it = inpsents.begin(); it != inpsents.end(); it++) { + auto sent = tokenize(*it); + tokens.insert(sent.begin(), sent.end()); + } + for (auto it = tokens.begin(); it != tokens.end(); it++) { + auto jt = inpemb.find(*it); + if (jt != inpemb.end()) { + inpfilemb[*it] = jt->second; + } + } + + string maxsim_str; + double maxsim_scr=0.0; + for (auto it = inpfilemb.begin(); it != inpfilemb.end(); it++) { + auto inp_s = it->first; + auto inp_v = it->second; + for (auto jt = hypemb.begin(); jt != hypemb.end(); jt++) { + auto hyp_s = jt->first; + auto hyp_v = jt->second; + double sim = 0.0; + for (int i = 0; i < dim; i++) { + sim += inp_v[i] * hyp_v[i]; + } + if (sim > maxsim_scr) { + maxsim_str = hyp_s; + maxsim_scr = sim; + } + } + INPMAP << inp_s << "\t" << maxsim_str << endl; + maxsim_scr = 0.0; + } + return 0; +} + diff --git a/src/lexsim.cpp b/src/lexsim.cpp index b8f69b0..f04e79e 100644 --- a/src/lexsim.cpp +++ b/src/lexsim.cpp @@ -45,15 +45,15 @@ double lexsimexact_t::get_sim(string ref, string hyp, int mode) { double lexsimlcs_t::get_sim(string ref, string hyp, int mode) { if (mode == yisi::INP_MODE) { cerr << "ERROR: longest common subsequence lex sim model is not defined " - << "in crosslingual settings. Exiting..." << endl; + << "in crosslingual settings. Exiting..." << endl; exit(1); } double lcs_n = 0.0; size_t ref_n = ref.length(); size_t hyp_n = hyp.length(); // find the length of the longest common character subsequence - for (size_t i = 0; i < ref_n - lcs_n - 1; i++) { - //cerr << "Current ref pos: " << i << endl; + for (size_t i = 0; i < ref_n - lcs_n; i++) { + // cerr << "Current ref pos: " << i << endl; size_t j; for (j = lcs_n + 1; j <= ref_n - i; j++) { //cerr << "Previous common length: " << lcs_n << endl; @@ -214,51 +214,51 @@ void lexsimw2v_t::write_txtw2v(std::string path) { cerr << "Done." << endl; } -lexsimemapw2v_t::lexsimemapw2v_t(string emap_path, string outw2v_path) -: lexsimw2v_t(outw2v_path) { - cerr << "Reading emap model from " << emap_path << endl; - ifstream EMAP(emap_path.c_str()); - if (!EMAP) { - cerr << "ERROR: fail to open ibm model. Exiting..." << endl; - exit(1); - } - while (!EMAP.eof()) { - string inp; - string hyp; - EMAP >> inp >> hyp; - emap_m[inp]=hyp; - } - EMAP.close(); - cerr << "Finished reading." << endl; +lexsimemapw2v_t::lexsimemapw2v_t(string emap_path, string outw2v_path) : + lexsimw2v_t(outw2v_path) { + cerr << "Reading emap model from " << emap_path << endl; + ifstream EMAP(emap_path.c_str()); + if (!EMAP) { + cerr << "ERROR: fail to open emap model. Exiting..." << endl; + exit(1); + } + while (!EMAP.eof()) { + string inp; + string hyp; + EMAP >> inp >> hyp; + emap_m[inp] = hyp; + } + EMAP.close(); + cerr << "Finished reading." << endl; } vector& lexsimemapw2v_t::get_wv(string word, int mode) { - if (mode == yisi::INP_MODE){ - if (emap_m.find(word) != emap_m.end()) { - word = emap_m[word]; - } else if (emap_m.find(lowercase(word)) != emap_m.end()){ - word = emap_m[lowercase(word)]; - } - } - return yisi::get_wv(outembeddings_m, word); + if (mode == yisi::INP_MODE) { + if (emap_m.find(word) != emap_m.end()) { + word = emap_m[word]; + } else if (emap_m.find(lowercase(word)) != emap_m.end()) { + word = emap_m[lowercase(word)]; + } + } + return yisi::get_wv(outembeddings_m, word); } double lexsimemapw2v_t::get_sim(string s1, string hyp, int mode) { - if (lowercase(s1) == lowercase(hyp)){ - return 1.0; - } else { - double result = this->get_sim(this->get_wv(s1, mode), this->get_wv(hyp, yisi::HYP_MODE)); - //cerr << "(" << s1 << "," << hyp << "," << mode << "," << result << ")" << endl; - return result; - } + if (lowercase(s1) == lowercase(hyp)) { + return 1.0; + } else { + double result = this->get_sim(this->get_wv(s1, mode), this->get_wv(hyp, yisi::HYP_MODE)); + //cerr << "(" << s1 << "," << hyp << "," << mode << "," << result << ")" << endl; + return result; + } } double lexsimemapw2v_t::get_sim(vector& s1, vector& hyp) { - if ((int)s1.size() == dimension_m && (int)hyp.size() == dimension_m) { - return yisi::get_sim(s1, hyp, func_m); - } else { - return 0.0; - } + if ((int)s1.size() == dimension_m && (int)hyp.size() == dimension_m) { + return yisi::get_sim(s1, hyp, func_m); + } else { + return 0.0; + } } lexsimbiw2v_t::lexsimbiw2v_t(string inpw2v_path, string outw2v_path) @@ -307,6 +307,15 @@ double lexsimbiw2v_t::get_sim(vector& s1, vector& hyp) { } } +double lexsimemb_t::get_sim(string s1, string hyp, int mode){ + cerr <<"ERROR: lexsim model is a contextual embedding model, cannot compute lexsim without providing the embedding. Exiting..." << endl; + exit(1); +} + +double lexsimemb_t::get_sim(vector& s1, vector& hyp){ + return yisi::get_sim(s1, hyp, func_m); +} + lexsim_t::lexsim_t() { lexsim_p = new lexsimexact_t(); } @@ -324,6 +333,8 @@ lexsim_t::lexsim_t(string name, string out_path, string inp_path) { lexsim_p = new lexsimbiw2v_t(inp_path, out_path); } else if (name == "lcs") { lexsim_p = new lexsimlcs_t(); + } else if (name == "emb"){ + lexsim_p = new lexsimemb_t(); } else { cerr << "ERROR: Unknown lexsim model type " << name << endl; } @@ -345,6 +356,8 @@ lexsim_t::lexsim_t(lexsim_t& rhs) { lexsim_p = new lexsimbiw2v_t(rhs.inplexsim_path_m, rhs.outlexsim_path_m); } else if (rhs.lexsim_name_m == "lcs") { lexsim_p = new lexsimlcs_t(); + } else if (rhs.lexsim_name_m == "emb") { + lexsim_p = new lexsimemb_t(); } lexsim_name_m = rhs.lexsim_name_m; outlexsim_path_m = rhs.outlexsim_path_m; @@ -404,7 +417,7 @@ void yisi::read_binw2v(string path, map >& model, int& di long long d = 0; char tmp; - cerr << "Reading w2v model from " << path << endl; + cerr << "Reading w2v binary model from " << path << endl; ifstream W2V(path.c_str(), ios::in | ios::binary); if (!W2V) { cerr << "ERROR: Failed to open w2v model. Exiting..." << endl; diff --git a/src/lexsim.h b/src/lexsim.h index f4ddb98..6362474 100644 --- a/src/lexsim.h +++ b/src/lexsim.h @@ -97,16 +97,28 @@ namespace yisi { int dimension_m; }; // class lexsimw2v_t + class lexsimemb_t:public lexsimmodel_t { + public: + lexsimemb_t() { + func_m = "cosine"; + } + virtual ~lexsimemb_t() {} + virtual double get_sim(std::string ref, std::string hyp, int mode); + virtual double get_sim(std::vector& ref, std::vector& hyp); + protected: + std::string func_m; + }; // class lexsimw2v_t + class lexsimemapw2v_t:public lexsimw2v_t { public: - lexsimemapw2v_t() {} - lexsimemapw2v_t(std::string emap_path, std::string outw2v_func); - virtual ~lexsimemapw2v_t() {} - std::vector& get_wv(std::string word, int mode); - virtual double get_sim(std::string s1, std::string hyp, int mode); - virtual double get_sim(std::vector& s1, std::vector& hyp); + lexsimemapw2v_t() {} + lexsimemapw2v_t(std::string emap_path, std::string outw2v_func); + virtual ~lexsimemapw2v_t() {} + std::vector& get_wv(std::string word, int mode); + virtual double get_sim(std::string s1, std::string hyp, int mode); + virtual double get_sim(std::vector& s1, std::vector& hyp); private: - std::map emap_m; + std::map emap_m; }; // class lexsimemapw2v_t // class lexsimibm_t:public lexsimmodel_t { diff --git a/src/ngram_test.cpp b/src/ngram_test.cpp new file mode 100644 index 0000000..7a4b901 --- /dev/null +++ b/src/ngram_test.cpp @@ -0,0 +1,43 @@ +/** + * @file w2v_test.cpp + * @brief Unit test for w2v lexsim. + * + * @author Jackie Lo + * + * Multilingual Text Processing / Traitement multilingue de textes + * Digital Technologies Research Centre / Centre de recherche en technologies numériques + * National Research Council Canada / Conseil national de recherches Canada + * Copyright 2019, Her Majesty in Right of Canada / + * Copyright 2019, Sa Majeste la Reine du Chef du Canada + */ + +#include "util.h" + +#include +#include +#include +#include + +using namespace std; +using namespace yisi; + +int main(int argc, char* argv[]) +{ + set result; + while (!cin.eof()) { + string line; + cin >> line; + auto tokens = tokenize(line); + auto ngrams = collect_ngram(atoi(argv[1]), tokens); + for (auto it = ngrams.begin(); it != ngrams.end(); it++) { + auto ngram = join(*it); + result.insert(ngram); + } + } + for (auto it = result.begin(); it != result.end(); it++) { + cout << *it << endl; + } + + return 0; +} + diff --git a/src/oov_test.cpp b/src/oov_test.cpp new file mode 100644 index 0000000..0c5c32c --- /dev/null +++ b/src/oov_test.cpp @@ -0,0 +1,39 @@ +/** + * @file w2v_test.cpp + * @brief Unit test for w2v lexsim. + * + * @author Jackie Lo + * + * Multilingual Text Processing / Traitement multilingue de textes + * Digital Technologies Research Centre / Centre de recherche en technologies numériques + * National Research Council Canada / Conseil national de recherches Canada + * Copyright 2019, Her Majesty in Right of Canada / + * Copyright 2019, Sa Majeste la Reine du Chef du Canada + */ + +#include "lexsim.h" + +#include + +using namespace std; +using namespace yisi; + +int main(int argc, char* argv[]) +{ + lexsim_t w2vtxt("w2v", argv[1], "cosine"); + string sent; + + while(!cin.eof()){ + getline(cin, sent); + //cout << sent << endl; + auto tokens = tokenize(sent); + for (auto it = tokens.begin(); it != tokens.end(); it++){ + if ((w2vtxt.get_wv(*it,HYP_MODE)).size() == 0){ + cout << *it << endl; + } + } + } + + return 0; +} + diff --git a/src/overlapvocab_test.cpp b/src/overlapvocab_test.cpp new file mode 100644 index 0000000..b675406 --- /dev/null +++ b/src/overlapvocab_test.cpp @@ -0,0 +1,49 @@ +/** + * @file lexsim_test.cpp + * @brief Unit test for lexsim. + * + * @author Jackie Lo + * + * Multilingual Text Processing / Traitement multilingue de textes + * Digital Technologies Research Centre / Centre de recherche en technologies numériques + * National Research Council Canada / Conseil national de recherches Canada + * Copyright 2019, Her Majesty in Right of Canada / + * Copyright 2019, Sa Majeste la Reine du Chef du Canada + */ + +#include "lexsim.h" + +#include + +using namespace std; +using namespace yisi; + +int main(int argc, char* argv[]) +{ + string inpembpath = argv[1]; + string hypembpath = argv[2]; + map > inpemb; + map > hypemb; + int dim; + read_binw2v(inpembpath, inpemb, dim); + read_binw2v(hypembpath, hypemb, dim); + + auto it = inpemb.begin(); + auto jt = hypemb.begin(); + while (it != inpemb.end() && jt != hypemb.end()) { + string inp = it->first; + string hyp = jt->first; + if (inp == hyp) { + cout << inp << " " << hyp << endl; + it++; + jt++; + } else if (inp.compare(hyp) < 0) { + it++; + } else { + jt++; + } + } + + return 0; +} + diff --git a/src/phrasesim.h b/src/phrasesim.h index a092c14..635a8ca 100644 --- a/src/phrasesim.h +++ b/src/phrasesim.h @@ -51,68 +51,68 @@ namespace yisi { using namespace com::masaers::cmdlp; p.add(make_knob(lexsim_name_m)) - .fallback("exact") - .desc("Type of lex sim model: [exact(default)|ibm1|w2v|ibmw2v]") - .name("lexsim-type") - ; + .fallback("exact") + .desc("Type of lex sim model: [exact(default)|ibm1|w2v|ibmw2v]") + .name("lexsim-type") + ; p.add(make_knob(outlexsim_path_m)) - .fallback("") - .desc("Path to lex sim model file in output language") - .name("outlexsim-path") - ; + .fallback("") + .desc("Path to lex sim model file in output language") + .name("outlexsim-path") + ; p.add(make_knob(inplexsim_path_m)) - .fallback("") - .desc("Path to lex sim model file in input language") - .name("inplexsim-path") - ; + .fallback("") + .desc("Path to lex sim model file in input language") + .name("inplexsim-path") + ; p.add(make_knob(inplexweight_name_m)) - .fallback("uniform") - .desc("Type of input lex weight model: [uniform(default)|file|learn]") - .name("inplexweight-type") - ; + .fallback("uniform") + .desc("Type of input lex weight model: [uniform(default)|file|learn]") + .name("inplexweight-type") + ; p.add(make_knob(inplexweight_path_m)) - .fallback("") - .desc("[file: path to input lex weight model file " - "| learn: monolingual corpus in input language to learn]") - .name("inplexweight-path") - ; + .fallback("") + .desc("[file: path to input lex weight model file " + "| learn: monolingual corpus in input language to learn]") + .name("inplexweight-path") + ; p.add(make_knob(reflexweight_name_m)) - .fallback("uniform") - .desc("Type of reference lex weight model: [uniform(default)|file|learn]") - .name("lexweight-type") - .name("reflexweight-type") - ; + .fallback("uniform") + .desc("Type of reference lex weight model: [uniform(default)|file|learn]") + .name("lexweight-type") + .name("reflexweight-type") + ; p.add(make_knob(reflexweight_path_m)) - .fallback("") - .desc("[file: path to reference lex weight model file " - "| learn: monolingual corpus in reference language to learn]") - .name("lexweight-path") - .name("reflexweight-path") - ; + .fallback("") + .desc("[file: path to reference lex weight model file " + "| learn: monolingual corpus in reference language to learn]") + .name("lexweight-path") + .name("reflexweight-path") + ; p.add(make_knob(hyplexweight_name_m)) - .fallback("") - .desc("Type of hypotheses lex weight model: [uniform|file|learn] " - "(default: same as reflexweight-type") - .name("hyplexweight-type") - ; + .fallback("") + .desc("Type of hypotheses lex weight model: [uniform|file|learn] " + "(default: same as reflexweight-type") + .name("hyplexweight-type") + ; p.add(make_knob(hyplexweight_path_m)) - .fallback("") - .desc("[file: path to hypotheses lex weight model file " - "| learn: monolingual corpus in hypothesis language to learn]") - .name("hyplexweight-path") - ; + .fallback("") + .desc("[file: path to hypotheses lex weight model file " + "| learn: monolingual corpus in hypothesis language to learn]") + .name("hyplexweight-path") + ; p.add(make_knob(phrasesim_name_m)) - .fallback("nwpr") - .desc("Type of phrase sim model: [nwpf: n-gram idf-weighted precision/recall]") - .name("psname") - .name("phrasesim-type") - ; + .fallback("nwpr") + .desc("Type of phrase sim model: [nwpf: n-gram idf-weighted precision/recall]") + .name("psname") + .name("phrasesim-type") + ; p.add(make_knob(n_m)) - .fallback(0) - .desc("N-gram size") - .name("ngram-size") - .name("n") - ; + .fallback(0) + .desc("N-gram size") + .name("ngram-size") + .name("n") + ; } }; // struct phrasesim_options @@ -245,7 +245,56 @@ namespace yisi { } else { mpscache_m[s1txt][hyptxt] = s; } - //std::cerr << "(" << s1txt << " ||| " << hyptxt << " ||| " << s.first << "," << s.second << ")" << std::endl; + return s; + }; + + std::pair operator()(std::vector s1tokens, + std::vector& hyptokens, + std::vector > s1embs, + std::vector > hypembs, int mode) { + std::pair result; + if (s1tokens.size() == 0 || hyptokens.size() == 0) { + result = std::make_pair(0.0, 0.0); + return result; + } + std::string s1txt; + size_t i; + for (i = 0; i < s1tokens.size() - 1; i++) { + s1txt = s1txt + s1tokens[i] + " "; + } + s1txt = s1txt + s1tokens[i]; + std::string hyptxt; + size_t j; + for (j = 0; j < hyptokens.size() - 1; j++) { + hyptxt = hyptxt + hyptokens[j] + " "; + } + hyptxt = hyptxt + hyptokens[j]; + + if (mode == yisi::INP_MODE) { + if (xpscache_m.find(s1txt) != xpscache_m.end()) { + if (xpscache_m[s1txt].find(hyptxt) != xpscache_m[s1txt].end()) { + return xpscache_m[s1txt][hyptxt]; + } + } else { + std::map > c; + xpscache_m[s1txt] = c; + } + } else { + if (mpscache_m.find(s1txt) != mpscache_m.end()) { + if (mpscache_m[s1txt].find(hyptxt) != mpscache_m[s1txt].end()) { + return mpscache_m[s1txt][hyptxt]; + } + } else { + std::map > c; + mpscache_m[s1txt] = c; + } + } + auto s = nwpr(s1tokens, hyptokens, s1embs, hypembs, mode); + if (mode == yisi::INP_MODE) { + xpscache_m[s1txt][hyptxt] = s; + } else { + mpscache_m[s1txt][hyptxt] = s; + } return s; }; @@ -255,7 +304,7 @@ namespace yisi { //std::cerr<<"ng: " << hyptokens.size()<get_sim(s1tokens[i], hyptokens[i], mode); //std::cerr << ls << std::endl; rresult += rw * ls; @@ -280,6 +329,45 @@ namespace yisi { rlen += rw; plen += pw; } + //std::cerr<<"(" << presult / plen<<","< result = std::make_pair(presult / plen, rresult / rlen); + return result; + } + + std::pair ngram(std::vector& s1tokens, + std::vector& hyptokens, + std::vector > s1embs, + std::vector > hypembs, + int mode) { + //std::cerr<<"ng: " << s1tokens.size()<get_sim(s1embs[i], hypembs[i]); + //std::cerr << ls << std::endl; + rresult += rw * ls; + presult += pw * ls; + rlen += rw; + plen += pw; + } std::pair result = std::make_pair(presult / plen, rresult / rlen); return result; } @@ -300,71 +388,93 @@ namespace yisi { std::pair nwpr(std::vector& s1tokens, std::vector& hyptokens, int mode) { - std::string s1txt = yisi::join(s1tokens); - std::string hyptxt = yisi::join(hyptokens); - //std::cerr << s1txt << std::endl; - //std::cerr< > s1ngrams; std::vector > hypngrams; if ((int)s1tokens.size() < n_m || (int)hyptokens.size() < n_m) { - s1ngrams = yisi::collect_ngram(std::min(s1tokens.size(), hyptokens.size()), s1tokens); - hypngrams = yisi::collect_ngram(std::min(s1tokens.size(), hyptokens.size()), hyptokens); + s1ngrams = yisi::collect_ngram(std::min(s1tokens.size(), hyptokens.size()), s1tokens); + hypngrams = yisi::collect_ngram(std::min(s1tokens.size(), hyptokens.size()), hyptokens); } else { - s1ngrams = yisi::collect_ngram(n_m, s1tokens); - hypngrams = yisi::collect_ngram(n_m, hyptokens); + s1ngrams = yisi::collect_ngram(n_m, s1tokens); + hypngrams = yisi::collect_ngram(n_m, hyptokens); } - //std::cerr << s1ngrams.size() << std::endl; - //std::cerr << hypngrams.size()<= 1) { - // return recall; - //} - nom = 0.0; denom = 0.0; - //std::cerr< 0.0){ - // return (precision*recall)/(a*precision+(1-a)*recall); - //} else { - // return 0.0; - //} + std::pair result = std::make_pair(precision, recall); + return result; + } + + std::pair nwpr(std::vector& s1tokens, + std::vector& hyptokens, + std::vector > s1embs, + std::vector > hypembs, + int mode) { + std::vector > s1ngrams; + std::vector > hypngrams; + std::vector > > s1embngrams; + std::vector > > hypembngrams; + + if ((int)s1tokens.size() < n_m || (int)hyptokens.size() < n_m) { + s1ngrams = yisi::collect_ngram(std::min(s1tokens.size(), hyptokens.size()), s1tokens); + hypngrams = yisi::collect_ngram(std::min(s1tokens.size(), hyptokens.size()), hyptokens); + s1embngrams = yisi::collect_ngram(std::min(s1tokens.size(), hyptokens.size()), s1embs); + hypembngrams = yisi::collect_ngram(std::min(s1tokens.size(), hyptokens.size()), hypembs); + } else { + s1ngrams = yisi::collect_ngram(n_m, s1tokens); + hypngrams = yisi::collect_ngram(n_m, hyptokens); + s1embngrams = yisi::collect_ngram(n_m, s1embs); + hypembngrams = yisi::collect_ngram(n_m, hypembs); + } + double nom = 0.0; + double denom = 0.0; + + for (size_t ii = 0; ii < s1ngrams.size(); ii++) { + double sim = 0.0; + double rw = ngramlw(s1ngrams[ii], mode); + + for (size_t jj = 0; jj < hypngrams.size(); jj++) { + sim = std::fmax(sim, ngram(s1ngrams[ii], hypngrams[jj], s1embngrams[ii], hypembngrams[jj], mode).second); + } + nom += rw * sim; + denom += rw; + } + double recall = nom / denom; + nom = 0.0; + denom = 0.0; + for (size_t iii = 0; iii < hypngrams.size(); iii++) { + double hs = 0.0; + double hw = ngramlw(hypngrams[iii], yisi::HYP_MODE); + for (size_t jjj = 0; jjj < s1ngrams.size(); jjj++) { + hs = std::fmax(hs, ngram(s1ngrams[jjj], hypngrams[iii], s1embngrams[jjj], hypembngrams[iii], mode).first); + } + nom += hw * hs; + denom += hw; + } + double precision = nom / denom; std::pair result = std::make_pair(precision, recall); return result; } diff --git a/src/sent.cpp b/src/sent.cpp new file mode 100644 index 0000000..93ddda3 --- /dev/null +++ b/src/sent.cpp @@ -0,0 +1,248 @@ +/** + * @file sent.cpp + * @brief Sentence + * + * @author Jackie Lo + * + * Class implementation for the classes: + * - sent_t + * and the definitions of some utility functions working on it. + * + * Multilingual Text Processing / Traitement multilingue de textes + * Digital Technologies Research Centre / Centre de recherche en technologies numériques + * National Research Council Canada / Conseil national de recherches Canada + * Copyright 2019, Her Majesty in Right of Canada / + * Copyright 2019, Sa Majeste la Reine du Chef du Canada + */ + +#include "sent.h" + +#include +#include +#include + +using namespace yisi; +using namespace std; + +sent_t::sent_t() { + sent_type_m = "word"; +} + +sent_t::sent_t(string sent_type) { + sent_type_m = sent_type; +} + +sent_t::sent_t(const sent_t& rhs) { + sent_type_m = rhs.sent_type_m; + token_m = rhs.token_m; + unit_m = rhs.unit_m; + emb_m = rhs.emb_m; + tid2uspan_m = rhs.tid2uspan_m; + uid2tid_m = rhs.uid2tid_m; +} + +void sent_t::operator=(const sent_t& rhs) { + sent_type_m = rhs.sent_type_m; + token_m = rhs.token_m; + unit_m = rhs.unit_m; + emb_m = rhs.emb_m; + tid2uspan_m = rhs.tid2uspan_m; + uid2tid_m = rhs.uid2tid_m; +} + +string sent_t::get_type() { + return sent_type_m; +} + +vector sent_t::get_tokens(span_type tspan) { + vector result; + for (size_t i = tspan.first; i < tspan.second; i++) { + result.push_back(token_m[i]); + } + /* + cerr << "In get_tokens(" << tspan.first << "," << tspan.second << "): "; + for (auto it = result.begin(); it != result.end(); it++) { + cerr << *it << " "; + } + cerr << endl; + */ + return result; +} + +vector sent_t::get_tokens() { + // cerr << "In get_tokens(): " << endl; + return token_m; +} + +vector sent_t::get_units(span_type uspan) { + vector result; + if (sent_type_m == "word") { + for (size_t i = uspan.first; i < uspan.second; i++) { + result.push_back(token_m[i]); + } + } else { + for (size_t i = uspan.first; i < uspan.second; i++) { + if (i < unit_m.size()) { + result.push_back(unit_m[i]); + } + } + } + return result; +} + +vector > sent_t::get_embs(span_type uspan) { + if (sent_type_m == "uemb") { + vector > result; + for (size_t i = uspan.first; i < uspan.second; i++) { + result.push_back(emb_m[i]); + } + return result; + } else { + cerr << "ERROR: sentence type (" << sent_type_m << ") " + << "does not provide contextual embeddings. Exiting..." << endl; + exit(1); + } +} + +sent_t::span_type sent_t::tspan2uspan(span_type tspan) { + if (sent_type_m == "word") { + return tspan; + } else { + //cerr << tid2uspan_m.size(); + if (tspan.first < tid2uspan_m.size() && (tspan.second-1) < tid2uspan_m.size()) { + return span_type(tid2uspan_m[tspan.first].first, tid2uspan_m[tspan.second-1].second); + } else { + return tspan; + } + } +} + +sent_t::span_type sent_t::uspan2tspan(span_type uspan) { + if (sent_type_m == "word") { + return uspan; + } else { + return span_type(uid2tid_m[uspan.first], uid2tid_m[uspan.second-1]); + } +} + +void sent_t::set_tokens(vector t) { + token_m = t; + /* + cerr << "In set_tokens(t): "; + for (auto it = token_m.begin(); it != token_m.end(); it++) { + cerr << *it << " "; + } + cerr << endl; + */ +} + +void sent_t::set_units(vector u ) { + unit_m = u; +} + +void sent_t::set_embs(vector > e) { + emb_m = e; +} + +void sent_t::set_tid2uspan(vector t2u) { + tid2uspan_m = t2u; +} + +void sent_t::set_uid2tid(vector u2t) { + uid2tid_m = u2t; +} + +size_t sent_t::get_token_size() { + return token_m.size(); +} + +vector yisi::read_sent(string sent_type, string token_path, string unit_path, string idemb_path) { + vector result; + vector > emb; + vector t2u; + vector u2t; + size_t currtid = (size_t)-1; + + //cerr << token_path << " "; + auto token_strs = read_file(token_path); + if (unit_path == "") { + for (auto tt = token_strs.begin(); tt != token_strs.end(); tt++) { + sent_t* sent_p = new sent_t(sent_type); + sent_p->set_tokens(tokenize(*tt)); + //cerr << "sentlength=" << sent.get_token_size(); + result.push_back(sent_p); + //cerr << " Done." << endl; + } + + } else { + // cerr << unit_path << " "; + auto unit_strs = read_file(unit_path); + auto tt = token_strs.begin(); + auto ut = unit_strs.begin(); + //cerr << idemb_path << " "; + ifstream fin(idemb_path.c_str()); + if (!fin) { + cerr << "ERROR: Failed to open idemb file (" << idemb_path << "). Exiting..." << endl; + exit(1); + } + while (!fin.eof()) { + string line; + getline(fin, line); + if (line.empty()) { + sent_t* s = new sent_t(sent_type); + auto tokens = tokenize(*tt); + s->set_tokens(tokens); + //cerr << "#token=" << tokens.size(); + auto units = tokenize(*ut); + s->set_units(units); + //cerr << " #unit=" << units.size(); + if (sent_type == "uemb") { + s->set_embs(emb); + } + //cerr << " #emb=" << emb.size() << " #dim=" << emb[0].size(); + s->set_tid2uspan(t2u); + //cerr << " #tid=" << t2u.size(); + s->set_uid2tid(u2t); + //cerr << " #uid=" << u2t.size() << " "; + + result.push_back(s); + tt++; + ut++; + emb.clear(); + t2u.clear(); + u2t.clear(); + currtid = (size_t)-1; + } else { + istringstream iss(line); + size_t uid; + size_t tid; + iss >> uid >> tid; + u2t.push_back(tid); + if (tid != currtid) { + t2u.push_back(sent_t::span_type(uid,uid+1)); + currtid=tid; + } else { + t2u.back().second=uid+1; + } + if (sent_type == "uemb") { + vector e; + double len = 0.0; + double v; + while (!iss.eof()) { + iss >> v; + e.push_back(v); + len += v*v; + } + len = sqrt(len); + for (size_t i = 0; i < e.size(); i++) { + e[i] /= len; + } + emb.push_back(e); + } + } + fin.peek(); + } + fin.close(); + } + return result; +} diff --git a/src/sent.h b/src/sent.h new file mode 100644 index 0000000..c688410 --- /dev/null +++ b/src/sent.h @@ -0,0 +1,65 @@ +/** + * @file sent.h + * @brief Sentence + * + * @author Jackie Lo + * + * Class definition of sentence classes: + * - sent_t + * and the declaration of some utility functions working on it. + * + * Multilingual Text Processing / Traitement multilingue de textes + * Digital Technologies Research Centre / Centre de recherche en technologies numériques + * National Research Council Canada / Conseil national de recherches Canada + * Copyright 2019, Her Majesty in Right of Canada / + * Copyright 2019, Sa Majeste la Reine du Chef du Canada + */ + +#ifndef SENT_H +#define SENT_H + +#include "util.h" + +#include +#include +#include +#include +#include + +namespace yisi { + + class sent_t { + public: + typedef std::pair span_type; + sent_t(); + sent_t(std::string sent_type); + sent_t(const sent_t& rhs); + void operator=(const sent_t& rhs); + ~sent_t() {}; + std::string get_type(); + std::vector get_tokens(span_type tspan); + std::vector get_tokens(); + std::vector get_units(span_type uspan); + std::vector > get_embs(span_type uspan); + void set_tokens(std::vector t); + void set_units(std::vector u); + void set_embs(std::vector > e); + void set_tid2uspan(std::vector t2u); + void set_uid2tid(std::vector u2t); + span_type tspan2uspan(span_type tspan); + span_type uspan2tspan(span_type uspan); + size_t get_token_size(); + private: + std::string sent_type_m; + std::vector token_m; + std::vector unit_m; + std::vector > emb_m; + std::vector tid2uspan_m; + std::vector uid2tid_m; + }; // class sent_t + + std::vector read_sent(std::string sent_type, std::string token_path, std::string unit_path="", std::string idemb_path=""); + +} // yisi + +#endif diff --git a/src/srl.cpp b/src/srl.cpp index a456687..59c54ef 100644 --- a/src/srl.cpp +++ b/src/srl.cpp @@ -51,10 +51,10 @@ srl_t::~srl_t() { } } -srlgraph_t srl_t::parse(string sent) { +srlgraph_t srl_t::parse(sent_t* sent) { return srl_p->parse(sent); } -vector srl_t::parse(vector sents) { +vector srl_t::parse(vector sents) { return srl_p->parse(sents); } diff --git a/src/srl.h b/src/srl.h index 079b0c0..fff746a 100644 --- a/src/srl.h +++ b/src/srl.h @@ -30,8 +30,8 @@ namespace yisi { srl_t(); srl_t(const std::string name, const std::string path=""); ~srl_t(); - srlgraph_t parse(std::string sent); - std::vector parse(std::vector sents); + srlgraph_t parse(sent_t* sent); + std::vector parse(std::vector sents); private: srlmodel_t* srl_p; }; // class srl_t diff --git a/src/srl_test.cpp b/src/srl_test.cpp index 432ec4d..997e571 100644 --- a/src/srl_test.cpp +++ b/src/srl_test.cpp @@ -27,21 +27,7 @@ int main(const int argc, const char* argv[]) if (argc == 1) { srl_t mate("mate", "parse_full_es.sh"); - vector sents; - - ifstream IN("test_es.txt"); - if (IN.fail() or IN.bad()) { - cerr << "ERROR: Failed to open: test_es.txt. Exiting..." << endl; - exit(1); - } - - while (!IN.eof()) { - string line; - getline(IN, line); - if (line != "") { - sents.push_back(line); - } - } + vector sents = read_sent("word", "test_es.txt"); auto r = mate.parse(sents); cout << "Done parsing " << r.size() << " srlgraphs." << endl; @@ -52,22 +38,7 @@ int main(const int argc, const char* argv[]) } else { srl_t parser(argv[1], argv[2]); - vector sents; - - ifstream IN(argv[3]); - if (IN.fail() or IN.bad()) { - cerr << "ERROR: Failed to open:" << argv[3] << ". Exiting..." << endl; - exit(1); - } - - while (!IN.eof()) { - string line; - getline(IN, line); - if (line != "") { - sents.push_back(line); - } - } - IN.close(); + vector sents = read_sent("word", string(argv[3])); auto r = parser.parse(sents); cout << "Done parsing " << r.size() << " srlgraphs." << endl; diff --git a/src/srlgraph.cpp b/src/srlgraph.cpp index 2ff72fb..b805d86 100644 --- a/src/srlgraph.cpp +++ b/src/srlgraph.cpp @@ -27,22 +27,22 @@ using namespace std; srlgraph_t::srlgraph_t() { } -srlgraph_t::srlgraph_t(vector& tokens) { - span_type r(0, tokens.size()); +srlgraph_t::srlgraph_t(sent_t* sent) { + span_type r(0, sent->get_token_size()); root_m = srl_m.new_node(r); - tokens_m = tokens; + sent_p = sent; } srlgraph_t::srlgraph_t(const srlgraph_t& rhs) { srl_m = rhs.srl_m; - tokens_m = rhs.tokens_m; + sent_p = rhs.sent_p; root_m = rhs.root_m; predof_m = predof_m; } void srlgraph_t::operator=(const srlgraph_t& rhs) { srl_m = rhs.srl_m; - tokens_m = rhs.tokens_m; + sent_p = rhs.sent_p; root_m = rhs.root_m; predof_m = predof_m; } @@ -53,10 +53,10 @@ srlgraph_t::srlnid_type srlgraph_t::new_root() { return root_m; } -srlgraph_t::srlnid_type srlgraph_t::new_root(vector& tokens) { - span_type span(0, tokens.size()); +srlgraph_t::srlnid_type srlgraph_t::new_root(sent_t* sent) { + span_type span(0, sent_p->get_token_size()); root_m = srl_m.new_node(span); - tokens_m = tokens; + sent_p = sent; return root_m; } @@ -112,22 +112,30 @@ srlgraph_t::srlnid_type srlgraph_t::get_pred(srlnid_type argid) { return predof_m[argid]; } -vector& srlgraph_t::get_sentence() { - return tokens_m; + +vector srlgraph_t::get_sentence() { + return sent_p->get_tokens(); } -vector srlgraph_t::get_role_fillers(srlnid_type roleid) { - vector fillers; +vector srlgraph_t::get_role_filler_units(srlnid_type roleid) { + //vector fillers; span_type span = srl_m.get_node_data(roleid); + //cerr<get_units(sent_p->tspan2uspan(span)); + //return fillers; +} - return fillers; +vector > srlgraph_t::get_role_filler_embs(srlnid_type roleid) { + span_type span = srl_m.get_node_data(roleid); + return sent_p->get_embs(sent_p->tspan2uspan(span)); } srlgraph_t::label_type srlgraph_t::get_role_label(srlnid_type roleid) { @@ -138,10 +146,23 @@ srlgraph_t::span_type srlgraph_t::get_role_span(srlnid_type roleid) { return srl_m.get_node_data(roleid); } +size_t srlgraph_t::get_sent_length() { + return sent_p->get_token_size(); +} + + void srlgraph_t::set_tokens(vector& tokens) { + //cerr<<"Setting new tokens..."; span_type r(0, tokens.size()); srl_m.set_node_data(root_m, r); - tokens_m = tokens; + sent_p->set_tokens(tokens); + //cerr << "Done"<get_token_size()); + srl_m.set_node_data(root_m, r); + sent_p = sent; } void srlgraph_t::set_role_span(srlnid_type roleid, span_type& span) { @@ -152,14 +173,17 @@ void srlgraph_t::set_role_label(srlnid_type roleid, label_type& label) { srl_m.set_edge_label(srl_m.get_outgoing_edges(roleid).at(0), label); } - +void srlgraph_t::delete_sent() { + //DO NOT USE UNLESS read_conll09batch(parsefile) is called to create list of srlgraphs + delete sent_p; + sent_p = NULL; +} ostream& srlgraph_t::operator<<(ostream& os) { vector preds = get_preds(); if (preds.size() > 0) { - for (vector::iterator it = preds.begin(); it != preds.end(); - it++) { - vector frame_tokens = tokens_m; + for (vector::iterator it = preds.begin(); it != preds.end(); it++) { + vector frame_tokens = sent_p->get_tokens(get_role_span(root_m)); span_type pred_span = get_role_span(*it); if (pred_span.first != pred_span.second) { frame_tokens[pred_span.first] = "[" + get_role_label(*it) + " " @@ -179,7 +203,8 @@ ostream& srlgraph_t::operator<<(ostream& os) { } } } else { - for (vector::iterator it = tokens_m.begin(); it != tokens_m.end(); it++) { + auto t = sent_p->get_tokens(); + for (auto it = t.begin(); it != t.end(); it++) { os << *it << " "; } os << endl; @@ -190,9 +215,8 @@ ostream& srlgraph_t::operator<<(ostream& os) { void srlgraph_t::print(ostream& os, int i) { vector preds = get_preds(); if (preds.size() > 0) { - for (vector::iterator it = preds.begin(); it != preds.end(); - it++) { - vector frame_tokens = tokens_m; + for (vector::iterator it = preds.begin(); it != preds.end(); it++) { + vector frame_tokens = sent_p->get_tokens(get_role_span(root_m)); span_type pred_span = get_role_span(*it); if (pred_span.first != pred_span.second) { frame_tokens[pred_span.first] = "[" + get_role_label(*it) + " " diff --git a/src/srlgraph.h b/src/srlgraph.h index dc7ca46..dffc71c 100644 --- a/src/srlgraph.h +++ b/src/srlgraph.h @@ -19,7 +19,7 @@ #define SRLGRAPH_H #include "graph.h" - +#include "sent.h" #include #include #include @@ -30,7 +30,7 @@ namespace yisi { class srlgraph_t { public: - typedef std::pair span_type; + typedef sent_t::span_type span_type; typedef std::string label_type; typedef graph_t::node_type srlnode_type; typedef graph_t::edge_type srledge_type; @@ -39,12 +39,12 @@ namespace yisi { srlgraph_t(); - srlgraph_t(std::vector& tokens); + srlgraph_t(sent_t* sent); srlgraph_t(const srlgraph_t& rhs); void operator=(const srlgraph_t& rhs); srlnid_type new_root(); - srlnid_type new_root(std::vector& tokens); + srlnid_type new_root(sent_t* sent); srlnid_type new_pred(); srlnid_type new_pred(span_type& span, label_type& label); srlnid_type new_arg(srlnid_type predid); @@ -56,26 +56,33 @@ namespace yisi { srlnid_type get_pred(srlnid_type argid); - std::vector& get_sentence(); - std::vector get_role_fillers(srlnid_type roleid); + std::vector get_sentence(); + std::vector get_role_filler_units(srlnid_type roleid); + std::vector > get_role_filler_embs(srlnid_type roleid); label_type get_role_label(srlnid_type roleid); span_type get_role_span(srlnid_type roleid); + std::string get_sent_type(){return sent_p->get_type();}; + size_t get_sent_length(); void set_tokens(std::vector& tokens); + void set_sent(sent_t* sent); void set_role_span(srlnid_type predid, span_type& span); void set_role_label(srlnid_type predid, label_type& label); std::ostream& operator<<(std::ostream& os); void print(std::ostream& os, int i); + void delete_sent(); + private: graph_t srl_m; - std::vector tokens_m; + sent_t* sent_p; + // std::vector tokens_m; srlnid_type root_m; std::map predof_m; }; // class srlgraph_t - + std::ostream& operator<<(std::ostream& os, srlgraph_t& srl); } // yisi diff --git a/src/srlgraph_test.cpp b/src/srlgraph_test.cpp index 0cfbd3c..612df54 100644 --- a/src/srlgraph_test.cpp +++ b/src/srlgraph_test.cpp @@ -22,26 +22,21 @@ using namespace std; using namespace yisi; -int main(int argc, char* argv[]) -{ - ifstream txtstr(argv[1], ifstream::in); +int main(int argc, char* argv[]) { - vector sents; - string line; - - while (getline(txtstr, line)) { - sents.push_back(line); - } + vector sents = read_sent("word", string(argv[1])); cout << "Reading ASSERT format parse file." << endl; vector srls = read_srl(sents, string(argv[2])); cout << "Printing srl parses:" << endl; - for (vector::iterator it = srls.begin(); it != srls.end(); - it++) { + for (auto it = srls.begin(); it != srls.end(); it++) { cout << (*it); } + for (auto it = sents.begin(); it != sents.end(); it++) { + delete *it; + *it = NULL; + } return 0; } - diff --git a/src/srlmate.cpp b/src/srlmate.cpp index 8bc9bf7..aa5dc1e 100644 --- a/src/srlmate.cpp +++ b/src/srlmate.cpp @@ -73,43 +73,33 @@ srlmate_t::srlmate_t(string path) { getline(iss, cfgv); if (cfgn == "yisi_home") { yisi_home = cfgv; - } - else if (cfgn == "mate_jars") { + } else if (cfgn == "mate_jars") { mate_jars = cfgv; - } - else if (cfgn == "lang") { + } else if (cfgn == "lang") { lang = cfgv; - } - else if (cfgn == "rerank") { + } else if (cfgn == "rerank") { if ((cfgv.compare("0") == 0) || (cfgv.compare("false") == 0)) { rerank = false; } else { rerank = true; } - } - else if (cfgn == "hybrid") { + } else if (cfgn == "hybrid") { if ((cfgv.compare("0") == 0) || (cfgv.compare("false") == 0)) { hybrid = false; } else { hybrid = true; } - } - else if (cfgn == "token") { + } else if (cfgn == "token") { token = cfgv; - } - else if (cfgn == "morph") { + } else if (cfgn == "morph") { morph = cfgv; - } - else if (cfgn == "lemma") { + } else if (cfgn == "lemma") { lemma = cfgv; - } - else if (cfgn == "tagger") { + } else if (cfgn == "tagger") { tagger = cfgv; - } - else if (cfgn == "parser") { + } else if (cfgn == "parser") { parser = cfgv; - } - else if (cfgn == "srl") { + } else if (cfgn == "srl") { srl = cfgv; } } @@ -187,18 +177,19 @@ string srlmate_t::noparse(vector tokens) { return yisi::strip(result); } -string srlmate_t::jrun(string sent) { +string srlmate_t::jrun(sent_t* sent) { string result = ""; - vector tokens = yisi::tokenize(sent); + vector tokens = sent->get_tokens(); + string sent_str = join(tokens); - if (!sent.empty() && tokens.size() <= 100) { + if (0 < tokens.size() && tokens.size() <= 100) { JNI_SAFE_CALL(methid, jen_m, GetMethodID(mate_class_m, "parse", "(Ljava/lang/String;)Ljava/lang/String;")); try { JNI_SAFE_CALL(jparse, jen_m, CallObjectMethod(mate_object_m, methid, - jen_m->NewStringUTF(sent.c_str()))); + jen_m->NewStringUTF(sent_str.c_str()))); result = jen_m->GetStringUTFChars((jstring)jparse, NULL); } catch (...) { result += noparse(tokens); @@ -209,13 +200,13 @@ string srlmate_t::jrun(string sent) { return result; } -srlgraph_t srlmate_t::parse(string sent) { +srlgraph_t srlmate_t::parse(sent_t* sent) { string srl_str = jrun(sent); - srlgraph_t result = read_conll09(srl_str); + srlgraph_t result = read_conll09(srl_str, sent); return result; } -vector srlmate_t::parse(vector sents) { +vector srlmate_t::parse(vector sents) { //batch srl-ing vector result; for (auto it = sents.begin(); it != sents.end(); it++) { diff --git a/src/srlmate.h b/src/srlmate.h index 0201287..a45bcf2 100644 --- a/src/srlmate.h +++ b/src/srlmate.h @@ -35,9 +35,9 @@ namespace yisi { srlmate_t() {} srlmate_t(std::string path); ~srlmate_t(); - std::string jrun(std::string sent); - srlgraph_t parse(std::string sent); - virtual std::vector parse(std::vector sents); + std::string jrun(sent_t* sent); + srlgraph_t parse(sent_t* sent); + virtual std::vector parse(std::vector sents); private: std::string noparse(std::vector tokens); static JavaVM* jvm_m; diff --git a/src/srlmate_test.cpp b/src/srlmate_test.cpp index cc4c2a6..51b7752 100644 --- a/src/srlmate_test.cpp +++ b/src/srlmate_test.cpp @@ -25,10 +25,22 @@ int main(const int argc, const char* argv[]) string sent; while (getline(cin, sent)) { - string mateout = mate.jrun(sent); + sent_t* s = new sent_t("word"); + auto tokens = tokenize(sent); + s->set_tokens(tokens); + /* + auto t = s->get_tokens(); + for (auto it = t.begin(); it != t.end(); it++) { + cerr <<*it <<" "; + } + cerr< +#include #include #include using namespace yisi; using namespace std; -vector yisi::read_srl(vector sents, string parsefile) { +vector yisi::read_srl(vector sents, string parsefile) { // read srl in ASSERT format vector result; typedef srlgraph_t::span_type span_type; typedef srlgraph_t::srlnid_type srlnid_type; - for (vector::iterator it = sents.begin(); it != sents.end(); it++) { - vector tokens = tokenize(*it); - srlgraph_t s(tokens); + for (auto it = sents.begin(); it != sents.end(); it++) { + //vector tokens = tokenize(*it); + srlgraph_t s(*it); result.push_back(s); } @@ -102,8 +103,9 @@ vector yisi::read_srl(vector sents, string parsefile) { } } // while (!iss.eof()) - if ((int)tmptok.size() > 0) { - result.at(id).set_tokens(tmptok); + if (tmptok.size() > result.at(id).get_sent_length()) { + //result.at(id).set_tokens(tmptok); + cerr << "ERROR: Tokenization of words changed by srl. Potential index failure!" << endl; } } // while (!ifs.eof()) ifs.close(); @@ -112,91 +114,126 @@ vector yisi::read_srl(vector sents, string parsefile) { return result; } // read_srl -srlgraph_t yisi::read_conll09(string parse) { +srlgraph_t yisi::read_conll09(string parse, sent_t* sent) { + srlgraph_t result(sent); if (parse.empty()) { - auto tokens = tokenize(parse); - srlgraph_t re(tokens); - return re; + return result; } - - srlgraph_t result; - result.new_root(); + // cerr << result << endl; + //result.new_root(); srlgraph_t::label_type plabel = "V"; + vector tokens; vector preds; - map pids; - vector > > args; - map > child; + vector p_nids; + vector > labels; + map > child; istringstream iss(parse); + int n_space = 0; while (!iss.eof()) { string t; getline(iss, t); vector field = tokenize(t, '\t', true); //ID FORM LEMMA PLEMMA POS PPOS FEAT PFEAT HEAD PHEAD DEPREL PDEPREL FILLPRED PRED APREDs - int id = stoi(field[0]) - 1; - tokens.push_back(field[1]); - int parent = stoi(field[8]); - if (parent > 0) { - if (child.find(parent - 1) != child.end()) { - child[parent - 1].push_back(id); - } else { - child[parent - 1] = vector(1, id); + int id = stoi(field[0]) - 1 -n_space; + //cerr << "Reading " << id; + if (field[1] != ""){ + tokens.push_back(field[1]); + int p = stoi(field[8]) - n_space; + if (p > 0) { + child[p - 1].insert(id); } - } - if (field[13] != "_") { - preds.push_back(id); - srlgraph_t::span_type s(id, id + 1); - srlgraph_t::srlnid_type pid = result.new_pred(s, plabel); - pids[id] = pid; - } - for (int i = 14; i < (int)field.size(); i++) { - if ((int)args.size() < i - 13) { - vector > a; - args.push_back(a); + + for (int i = 14; i < (int)field.size(); i++) { + if (tokens.size() == 1){ + vector l; + l.push_back(field[i]); + labels.push_back(l); + } else { + labels[i-14].push_back(field[i]); + } + } + if (field[13] != "_") { + preds.push_back(id); + srlgraph_t::span_type s(id, id + 1); + srlgraph_t::srlnid_type pid = result.new_pred(s, plabel); + p_nids.push_back(pid); + labels[preds.size()-1][id]="V"; } - if (field[i] != "_") { - args[i - 14].push_back(make_pair(id, field[i])); + //cerr << " Done." << endl; + } else { + n_space++; + if (field[13] != "_") { + preds.push_back(-1); + p_nids.push_back(10000); } } } // while (!iss.eof()) - result.set_tokens(tokens); - for (int i = 0; i < (int)args.size(); i++) { - srlgraph_t::srlnid_type pid = pids[preds[i]]; - for (int j = 0; j < (int)args[i].size(); j++) { - int head = args[i][j].first; - srlgraph_t::label_type label = args[i][j].second; - size_t b = head; - size_t e = head; - resolve_arg_span(child, head, preds[i], b, e); - srlgraph_t::span_type s(b, e + 1); - result.new_arg(pid, s, label); + if (result.get_sent_type() == "word") { + if (tokens.size() > result.get_sent_length()) { + if (result.get_sent_length() > 0) + cerr << "Set tokens rule fired (" << tokens.size() << "," + << result.get_sent_length() << ")" << endl; + result.set_tokens(tokens); + } + } else { + if (result.get_sent_length() > 0 && tokens.size() > result.get_sent_length()) { + cerr << "ERROR: Tokenization of words changed by srl. Potential index failure!" << endl; + cerr << "Tokens were: " << join(result.get_sentence(), " ") << endl; + cerr << "Tokens are: " << join(tokens, " ") << endl; } } - return result; -} // read_conll09 - -void yisi::resolve_arg_span(map > child, int curid, - srlgraph_t::srlnid_type pid, size_t& b, size_t&e) { - //cerr << curid << "," << pid << "," << b << "," << e << endl; - auto curchild = child[curid]; - bool find = false; - for (auto it = curchild.begin(); it != curchild.end() && !find; it++) { - if (*it == (int)pid) { - find = true; + for (int i = 0; i < (int)labels.size(); i++) { + for (int j = 0; j < (int) labels[i].size(); j++){ + populate_label(labels[i], child, j); } } - if (!find) { - for (auto it = curchild.begin(); it != curchild.end(); it++) { - if (*it < (int)b) { - b = *it; + for (int i = 0; i < (int)labels.size(); i++) { + auto pid = p_nids[i]; + if (pid != 10000) { + srlgraph_t::span_type curspan; + srlgraph_t::label_type curlabel = "_"; + for (size_t j = 0; j < labels[i].size(); j++) { + //cerr << labels[i][j] << " "; + if (labels[i][j] != curlabel) { + if (curlabel != "_" && curlabel != "V") { + curspan.second = j; + result.new_arg(pid, curspan, curlabel); + } + curspan.first = j; + curlabel = labels[i][j]; + } } - if (*it > (int)e) { - e = *it; + if (curlabel != "_" && curlabel != "V") { + curspan.second = labels[i].size(); + result.new_arg(pid, curspan, curlabel); + } + //cerr << endl; + } + } + return result; +} // read_conll09 + +srlgraph_t yisi::read_conll09(string parse) { + sent_t* sent = new sent_t("word"); + + auto result = read_conll09(parse, sent); + + return result; +} // read_conll09 + +void yisi::populate_label(vector& labels, map > child, int i) { + if (labels[i] != "_" && labels[i] != "V") { + auto curchildren = child[i]; + for (auto ct = curchildren.begin(); ct != curchildren.end(); ct++) { + //cerr << "Label " << *ct << " " << labels[*ct] << endl; + if (labels[*ct] == "_") { + labels[*ct] = labels[i]; + populate_label(labels, child, *ct); } - resolve_arg_span(child, *it, pid, b, e); } } } @@ -211,13 +248,40 @@ vector yisi::read_conll09batch(string filename) { } string parse; - + int i=0; while (!fin.eof()) { string line; getline(fin, line); if (line.empty()) { result.push_back(read_conll09(yisi::strip(parse))); parse = ""; + i++; + } else { + parse += line + "\n"; + } + fin.peek(); + } + return result; +} + +vector yisi::read_conll09batch(string filename, vector sents) { + vector result; + + ifstream fin(filename.c_str()); + if (!fin) { + cerr << "ERROR: Failed to open conll09 parse file (" << filename << "). Exiting..." << endl; + exit(1); + } + + string parse; + int i=0; + while (!fin.eof()) { + string line; + getline(fin, line); + if (line.empty()) { + result.push_back(read_conll09(yisi::strip(parse), sents[i])); + parse = ""; + i++; } else { parse += line + "\n"; } @@ -228,21 +292,19 @@ vector yisi::read_conll09batch(string filename) { srlread_t::srlread_t(string parsefile):parsefile_m(parsefile) {} -vector srlread_t::parse(vector sents) { - return yisi::read_conll09batch(parsefile_m); +vector srlread_t::parse(vector sents) { + return yisi::read_conll09batch(parsefile_m, sents); } -vector srltok_t::parse(vector sents) { +vector srltok_t::parse(vector sents) { vector result; for (auto it = sents.begin(); it != sents.end(); it++) { - auto tokens = yisi::tokenize(*it); - result.push_back(srlgraph_t(tokens)); + result.push_back(srlgraph_t(*it)); } return result; } -srlgraph_t srltok_t::parse(string sent) { - auto tokens = yisi::tokenize(sent); - auto result = srlgraph_t(tokens); +srlgraph_t srltok_t::parse(sent_t* sent) { + auto result = srlgraph_t(sent); return result; } diff --git a/src/srlutil.h b/src/srlutil.h index ab40638..499f1b5 100644 --- a/src/srlutil.h +++ b/src/srlutil.h @@ -22,35 +22,36 @@ #include "srlgraph.h" +#include #include #include #include namespace yisi { - std::vector read_srl(std::vector sents, std::string parsefile); + std::vector read_srl(std::vector sents, std::string parsefile); + srlgraph_t read_conll09(std::string parse, sent_t* sent); srlgraph_t read_conll09(std::string parse); - void resolve_arg_span(std::map > child, int curid, - srlgraph_t::srlnid_type pid, size_t& b, size_t&e); + void populate_label(std::vector& labels, std::map > child, int i); std::vector read_conll09batch(std::string filename); - + std::vector read_conll09batch(std::string filename, std::vector sents); class srlmodel_t { public: srlmodel_t() {} virtual ~srlmodel_t() {} - virtual srlgraph_t parse(std::string) { + virtual srlgraph_t parse(sent_t* sent) { std::cerr << "ERROR: Semantic role labeler type does not support " << "individual sentence parsing. Exiting..." << std::endl; exit(1); } - virtual std::vector parse(std::vector)=0; + virtual std::vector parse(std::vector sent)=0; }; // srlmodel_t class srlread_t:public srlmodel_t { public: srlread_t() {} srlread_t(std::string parsefile); - virtual std::vector parse(std::vector sents); + virtual std::vector parse(std::vector sents); private: std::string parsefile_m; }; // class srlread_t @@ -58,8 +59,8 @@ namespace yisi { class srltok_t:public srlmodel_t { public: srltok_t() {} - virtual srlgraph_t parse(std::string sent); - virtual std::vector parse(std::vector sents); + virtual srlgraph_t parse(sent_t* sent); + virtual std::vector parse(std::vector sents); private: }; //class srltok_t diff --git a/src/srlutil_test.cpp b/src/srlutil_test.cpp index df45cf9..d7ba9d9 100644 --- a/src/srlutil_test.cpp +++ b/src/srlutil_test.cpp @@ -23,10 +23,14 @@ int main(const int argc, const char* argv[]) { auto s = read_conll09batch(argv[1]); - for (auto it=s.begin(); it!=s.end(); it++){ + for (auto it = s.begin(); it != s.end(); it++) { cout << *it; } + for (auto it = s.begin(); it != s.end(); it++) { + it->delete_sent(); + } + return 0; } diff --git a/src/util.cpp b/src/util.cpp index 892750e..59cf30f 100644 --- a/src/util.cpp +++ b/src/util.cpp @@ -29,13 +29,15 @@ using namespace std; vector yisi::tokenize(string sent, char d, bool keep_empty) { //cerr << "Tokenizing " << sent << " by " << d << endl; - vector result; - istringstream iss(sent); - while (!iss.eof()) { - string token; - getline(iss, token, d); - if (token != "" || keep_empty) { - result.push_back(token); + vector result; + if (sent != "") { + istringstream iss(sent); + while (!iss.eof()) { + string token; + getline(iss, token, d); + if (token != "" || keep_empty) { + result.push_back(token); + } } } //cerr << endl; @@ -52,18 +54,6 @@ string yisi::join(const vector tokens, const string d) { return result; } -vector > yisi::collect_ngram(int n, vector& tokens) { - vector > result; - for (int i = 0; i <= (int)tokens.size() - n; i++) { - vector ngram; - for (int j = i; j < i + n; j++) { - ngram.push_back(tokens[j]); - } - result.push_back(ngram); - } - return result; -} - vector yisi::read_file(string filename) { vector result; ifstream fin(filename.c_str()); diff --git a/src/util.h b/src/util.h index b3aba7f..cae1439 100644 --- a/src/util.h +++ b/src/util.h @@ -25,7 +25,17 @@ namespace yisi { std::vector tokenize(std::string sent, char d = ' ', bool keep_empty = false); std::string join(const std::vector tokens, const std::string d = " "); - std::vector > collect_ngram(int n, std::vector& tokens); + template std::vector > collect_ngram(int n, std::vector& tokens){ + std::vector > result; + for (int i = 0; i <= (int)tokens.size() - n; i++) { + std::vector ngram; + for (int j = i; j < i + n; j++) { + ngram.push_back(tokens[j]); + } + result.push_back(ngram); + } + return result; + } std::vector read_file(std::string filename); void open_ofstream(std::ofstream& fout, std::string filename); std::string lowercase(std::string token); diff --git a/src/yisi.cpp b/src/yisi.cpp index 37cb1f4..a3523f4 100644 --- a/src/yisi.cpp +++ b/src/yisi.cpp @@ -24,28 +24,54 @@ using namespace std; using namespace yisi; struct eval_options { + std::string ref_type_m; + std::string hyp_type_m; + std::string inp_type_m; + std::string ref_file_m; std::string hyp_file_m; std::string inp_file_m; + std::string inpunit_file_m; + std::string refunit_file_m; + std::string hypunit_file_m; + std::string inpidemb_file_m; + std::string refidemb_file_m; + std::string hypidemb_file_m; + std::string sntscore_file_m; std::string docscore_file_m; + std::string mode_m; void init(com::masaers::cmdlp::parser& p) { using namespace com::masaers::cmdlp; - + p.add(make_knob(ref_type_m)) + .fallback("word") + .desc("Type of reference sentences. [word(default)|unit|uemb]") + .name("ref-type") + ; + p.add(make_knob(hyp_type_m)) + .fallback("word") + .desc("Type of hypothese sentences. [word(default)|unit|uemb]") + .name("hyp-type") + ; + p.add(make_knob(inp_type_m)) + .fallback("word") + .desc("Filename of input. [word(default)|unit|uemb]") + .name("inp-type") + ; p.add(make_knob(ref_file_m)) .fallback("") - .desc("Filenames of references separated by ':'") + .desc("Filenames of references separated by ':'. (in surface word form for SRL.)") .name("ref-file") ; p.add(make_knob(hyp_file_m)) - .desc("Filename of hypotheses") + .desc("Filename of hypotheses. (in surface word form for SRL.)") .name("hyp-file") ; p.add(make_knob(inp_file_m)) .fallback("") - .desc("Filename of input") + .desc("Filename of input. (in surface word form for SRL.)") .name("inp-file") ; p.add(make_knob(sntscore_file_m)) @@ -58,6 +84,39 @@ struct eval_options { .desc("Filename of document score output (default: .doc") .name("docscore-file") ; + p.add(make_knob(inpunit_file_m)) + .fallback("") + .desc("Filename to input segmented in subword units.") + .name("inpunit-file") + ; + p.add(make_knob(hypunit_file_m)) + .fallback("") + .desc("Filename to hypotheses segmented in subword units.") + .name("hypunit-file") + ; + p.add(make_knob(refunit_file_m)) + .fallback("") + .desc("Filename to reference segmented in subword units separated by ':'.") + .name("refunit-file") + ; + p.add(make_knob(inpidemb_file_m)) + .fallback("") + .desc("Filename to input subword units with contextual embeddings: one unit per line, " + "empty line separates sentences [unitidtokenidspace_sep_emb].") + .name("inpidemb-file") + ; + p.add(make_knob(hypidemb_file_m)) + .fallback("") + .desc("Filename to hypotheses subword units with contextual embeddings: one unit per line, " + "empty line separates sentences [unitidtokenidspace_sep_emb].") + .name("hypidemb-file") + ; + p.add(make_knob(refidemb_file_m)) + .fallback("") + .desc("Filename to reference subword units with contextual embeddings separated by ':': one " + "unit per line, empty line separates sentences [unitidtokenidspace_sep_emb].") + .name("refidemb-file") + ; p.add(make_knob(mode_m)) .fallback("yisi") .desc("Output mode of YiSi [yisi(default): print score only " @@ -71,19 +130,31 @@ int main(const int argc, const char* argv[]) { typedef com::masaers::cmdlp::options options_type; - options_type opt(argc,argv); - if (! opt) { + options_type opt(argc, argv); + if (!opt) { return opt.exit_code(); } if (opt.reflexweight_name_m == "learn" && opt.reflexweight_path_m == "") { - opt.reflexweight_path_m = opt.ref_file_m; + if (opt.ref_type_m == "word") { + opt.reflexweight_path_m = opt.ref_file_m; + } else { + opt.reflexweight_path_m = opt.refunit_file_m; + } } if (opt.hyplexweight_name_m == "learn" && opt.hyplexweight_path_m == "") { - opt.hyplexweight_path_m = opt.hyp_file_m; + if (opt.hyp_type_m == "word") { + opt.hyplexweight_path_m = opt.hyp_file_m; + } else { + opt.hyplexweight_path_m = opt.hypunit_file_m; + } } if (opt.inplexweight_name_m == "learn" && opt.inplexweight_path_m == "") { - opt.inplexweight_path_m = opt.inp_file_m; + if (opt.inp_type_m == "word") { + opt.inplexweight_path_m = opt.inp_file_m; + } else { + opt.inplexweight_path_m = opt.inpunit_file_m; + } } yisiscorer_t yisi(opt); @@ -99,65 +170,77 @@ int main(const int argc, const char* argv[]) ofstream SNTOUT; open_ofstream(SNTOUT, opt.sntscore_file_m); - vector hypsents = read_file(opt.hyp_file_m); + cerr << "Reading hyp sents... "; + vector hypsents = read_sent(opt.hyp_type_m, opt.hyp_file_m, opt.hypunit_file_m, opt.hypidemb_file_m); + cerr << "Done." << endl; - vector > refsents; + vector < vector > refsents; if (opt.ref_file_m != "") { + cerr << "Reading ref sents... "; auto reffiles = tokenize(opt.ref_file_m, ':'); - auto it = reffiles.begin(); - vector < string > rs = read_file(*it); + auto refunits = tokenize(opt.refunit_file_m, ':'); + auto refidemb = tokenize(opt.refidemb_file_m, ':'); + size_t i = 0; + vector rs; + if (reffiles.size() == refunits.size()) { + rs = read_sent(opt.ref_type_m, reffiles[i], refunits[i], refidemb[i]); + } else { + rs = read_sent(opt.ref_type_m, reffiles[i]); + } if (rs.size() == hypsents.size()) { for (auto jt = rs.begin(); jt != rs.end(); jt++) { - vector < string > ref; + vector ref; ref.push_back(*jt); refsents.push_back(ref); } - it++; - for (; it != reffiles.end(); it++) { - rs = read_file(*it); + i++; + for (; i < reffiles.size(); i++) { + rs = read_sent(opt.ref_type_m, reffiles[i], refunits[i], refidemb[i]); if (rs.size() == hypsents.size()) { for (size_t j = 0; j < rs.size(); j++) { refsents[j].push_back(rs[j]); } } else { cerr << "ERROR: No. of sentences in ref-file (" << rs.size() - << ") does not match with no. of sentences in hyp-file (" - << hypsents.size() << "). Check your input! Exiting ..." - << endl; + << ") does not match with no. of sentences in hyp-file (" + << hypsents.size() << "). Check your input! Exiting ..." + << endl; exit(1); } } } else { cerr << "ERROR: No. of sentences in ref-file (" << rs.size() - << ") does not match with no. of sentences in hyp-file (" - << hypsents.size() << "). Check your input! Exiting ..." << endl; + << ") does not match with no. of sentences in hyp-file (" + << hypsents.size() << "). Check your input! Exiting ..." << endl; exit(1); } + cerr << "Done." << endl; } - vector inpsents; + vector inpsents; if (opt.inp_file_m != "") { - inpsents = read_file(opt.inp_file_m); - } - - if (inpsents.size() > 0 && inpsents.size() != hypsents.size()) { - cerr << "ERROR: No. of sentences in inp-file (" << inpsents.size() - << ") does not match with no. of sentences in hyp-file (" - << hypsents.size() << "). Check your input! Exiting..." << endl; - exit(1); + cerr << "Reading inp sents... "; + inpsents = read_sent(opt.inp_type_m, opt.inp_file_m, opt.inpunit_file_m, opt.inpidemb_file_m); + if (inpsents.size() != hypsents.size()) { + cerr << "ERROR: No. of sentences in inp-file (" << inpsents.size() + << ") does not match with no. of sentences in hyp-file (" + << hypsents.size() << "). Check your input! Exiting..." << endl; + exit(1); + } + cerr << "Done." << endl; } - cerr << "Tokenizing/SRL-ing hyp ... "; + cerr << "Creating hyp srlgraphs... "; vector hypsrlgraphs = yisi.hypsrlparse(hypsents); cerr << "Done." << endl; - vector > refsrlgraphs; + vector < vector > refsrlgraphs; for (size_t i = 0; i < hypsrlgraphs.size(); i++) { refsrlgraphs.push_back(vector()); } if (refsents.size() > 0) { - cerr << "Tokenizing/SRL-ing ref ... "; + cerr << "Creating ref srlgraphs... "; for (size_t i = 0; i < hypsrlgraphs.size(); i++) { refsrlgraphs[i] = yisi.refsrlparse(refsents[i]); } @@ -166,7 +249,7 @@ int main(const int argc, const char* argv[]) vector inpsrlgraphs; if (inpsents.size() > 0) { - cerr << "Tokenizing/SRL-ing inp ... "; + cerr << "Creating inp srlgraphs... "; inpsrlgraphs = yisi.inpsrlparse(inpsents); cerr << "Done." << endl; } @@ -177,9 +260,19 @@ int main(const int argc, const char* argv[]) cout << "Evaluating line " << i + 1 << endl; yisigraph_t m; if (opt.inp_file_m != "") { + /* + cerr<<"inpsrlgraph:"<begin(); jt != it->end(); jt++) { + delete *jt; + *jt = NULL; + } + } + for (auto it = inpsents.begin(); it != inpsents.end(); it++) { + delete *it; + *it = NULL; + } + return 0; } diff --git a/src/yisigraph.cpp b/src/yisigraph.cpp index cc05785..fac9768 100644 --- a/src/yisigraph.cpp +++ b/src/yisigraph.cpp @@ -22,7 +22,8 @@ using namespace yisi; using namespace std; -yisigraph_t::yisigraph_t(const vector refsrlgraph, const srlgraph_t hypsrlgraph) { +yisigraph_t::yisigraph_t(const vector refsrlgraph, + const srlgraph_t hypsrlgraph) { refsrlgraph_m = refsrlgraph; hypsrlgraph_m = hypsrlgraph; inp_b = false; @@ -32,7 +33,7 @@ yisigraph_t::yisigraph_t(const vector refsrlgraph, const srlgraph_t //cout << refsrlgraph_m; //cout << "hypsrlgraph:" << endl; //cout << hypsrlgraph_m; - //cout<<"Done."< refsrlgraph, @@ -82,6 +83,7 @@ size_t yisigraph_t::get_refsize() { return refsrlgraph_m.size(); } +/* double yisigraph_t::get_sentlength(int mode, int refid) { switch (mode) { case yisi::INP_MODE: @@ -104,10 +106,11 @@ double yisigraph_t::get_sentlength(int mode, int refid) { } break; default: - cerr << "ERROR: Unknown mode in sent length. Contact Jackie. Exiting..." << endl; + cerr << "ERROR: Unknown mode in sent length. Contact Jackie. Exiting..." << endl; exit(1); } } +*/ double yisigraph_t::get_sentsim(int mode, int refid) { double result = 0.0; @@ -203,6 +206,7 @@ vector yisigraph_t::get_args(srlnid_type roleid, int m } } +/* vector& yisigraph_t::get_sentence(int mode, int refid) { switch (mode) { case yisi::INP_MODE: @@ -231,12 +235,13 @@ vector& yisigraph_t::get_sentence(int mode, int refid) { exit(1); } } +*/ -vector yisigraph_t::get_role_fillers(srlnid_type roleid, int mode, int refid) { +vector yisigraph_t::get_role_filler_units(srlnid_type roleid, int mode, int refid) { switch (mode) { case yisi::INP_MODE: if (inp_b) { - return inpsrlgraph_m.get_role_fillers(roleid); + return inpsrlgraph_m.get_role_filler_units(roleid); } else { cerr << "ERROR: YiSi graph with no input sentence. " << "Failed to get input role fillers. Exiting..." << endl; @@ -244,11 +249,11 @@ vector yisigraph_t::get_role_fillers(srlnid_type roleid, int mode, int r } break; case yisi::HYP_MODE: - return hypsrlgraph_m.get_role_fillers(roleid); + return hypsrlgraph_m.get_role_filler_units(roleid); break; case yisi::REF_MODE: if (-1 < refid && refid < (int)refsrlgraph_m.size()) { - return refsrlgraph_m[refid].get_role_fillers(roleid); + return refsrlgraph_m[refid].get_role_filler_units(roleid); } else { cerr << "ERROR: refid (" << refid << ") out of range [0," << refsrlgraph_m.size() << "]. Failed to get reference role fillers. Exiting..." << endl; @@ -437,29 +442,29 @@ double yisigraph_t::spanlength(span_type span) { } void yisigraph_t::print(ostream& os) { - string h = yisi::join(hypsrlgraph_m.get_sentence(), " "); + string h = yisi::join(hypsrlgraph_m.get_role_filler_units(hypsrlgraph_m.get_root()), " "); //os << h <first; auto hypnid = (jt->second).first; double sim = (jt->second).second; - r = yisi::join(refsrlgraph_m[i].get_role_fillers(refnid), " "); - h = yisi::join(hypsrlgraph_m.get_role_fillers(hypnid), " "); + r = yisi::join(refsrlgraph_m[i].get_role_filler_units(refnid), " "); + h = yisi::join(hypsrlgraph_m.get_role_filler_units(hypnid), " "); os << r << "\t" << h << "\t" << sim << endl; } } if (inp_b) { - string inp = yisi::join(inpsrlgraph_m.get_sentence(), " "); + string inp = yisi::join(inpsrlgraph_m.get_role_filler_units(inpsrlgraph_m.get_root()), " "); os << inp << endl; for (auto kt = inpalignment_m.begin(); kt != inpalignment_m.end(); kt++) { auto inpnid = kt->first; auto hypnid = (kt->second).first; double sim = (kt->second).second; - inp = yisi::join(inpsrlgraph_m.get_role_fillers(inpnid), " "); - h = yisi::join(hypsrlgraph_m.get_role_fillers(hypnid), " "); + inp = yisi::join(inpsrlgraph_m.get_role_filler_units(inpnid), " "); + h = yisi::join(hypsrlgraph_m.get_role_filler_units(hypnid), " "); os << inp << "\t" << h << "\t" << sim << endl; } } diff --git a/src/yisigraph.h b/src/yisigraph.h index 07f37a3..69a8211 100644 --- a/src/yisigraph.h +++ b/src/yisigraph.h @@ -29,7 +29,7 @@ namespace yisi { - class yisigraph_t{ + class yisigraph_t { public: typedef srlgraph_t::span_type span_type; typedef srlgraph_t::label_type label_type; @@ -49,12 +49,12 @@ namespace yisi { bool withinp(); size_t get_refsize(); - double get_sentlength(int mode, int refid=-1); + // double get_sentlength(int mode, int refid=-1); double get_sentsim(int mode, int refid=-1); std::vector get_preds(int mode, int refid=-1); std::vector get_args(srlnid_type roleid, int mode, int refid=-1); - std::vector& get_sentence(int mode, int refid=-1); - std::vector get_role_fillers(srlnid_type roleid, int mode, int refid=-1); + // std::vector& get_sentence(int mode, int refid=-1); + std::vector get_role_filler_units(srlnid_type roleid, int mode, int refid=-1); double get_rolespanlength(srlnid_type roleid, int mode, int refid=-1); label_type get_rolelabel(srlnid_type roleid, int mode, int refid=-1); std::vector > get_hypalignment(srlnid_type roleid); @@ -73,23 +73,33 @@ namespace yisi { std::map > > hypalignment_m; std::map inpalignment_m; bool inp_b; - }; // class yisigraph_t - + }; // class yisigraph_t + template void yisigraph_t::align(phrasesim_t* phrasesim) { //yisi alignment algorithm goes here //loop all references and input for (size_t refid = 0; refid < refsrlgraph_m.size(); refid++) { - //std::cerr << "first align the sentence node" << std::endl; - auto r = refsrlgraph_m[refid].get_sentence(); - //std::cerr << "Got r " << r.size() << std::endl; - auto h = hypsrlgraph_m.get_sentence(); - //std::cerr << "Got h " << h.size() << std::endl; - std::pair sentsim = (*phrasesim)(r, h, yisi::REF_MODE); - //std::cerr << "sentsim = (" << sentsim.first << "," << sentsim.second << ")"; + //std::cerr << "first align the sentence node of ref" << refid << std::endl; auto refroot = refsrlgraph_m[refid].get_root(); - //std::cerr << "refroot = " << refroot << std::endl; auto hyproot = hypsrlgraph_m.get_root(); + + auto ru = refsrlgraph_m[refid].get_role_filler_units(refroot); + //std::cerr << "Got r " << ru.size() << std::endl; + auto hu = hypsrlgraph_m.get_role_filler_units(hyproot); + //std::cerr << "Got h " << hu.size() << std::endl; + std::pair sentsim; + if (refsrlgraph_m[refid].get_sent_type() != "uemb" || hypsrlgraph_m.get_sent_type() != "uemb") { + //std::cerr<<"computing sentsim on word"< predsim = - (*phrasesim)(refpredphrase, hyppredphrase, yisi::REF_MODE); + auto hyppredphrase = hypsrlgraph_m.get_role_filler_units(hyppredid); + std::pair predsim; + if (refsrlgraph_m[refid].get_sent_type() != "uemb" || hypsrlgraph_m.get_sent_type() != "uemb") { + predsim = (*phrasesim)(refpredphrase, hyppredphrase, yisi::REF_MODE); + } else { + auto rpredemb = refsrlgraph_m[refid].get_role_filler_embs(refpredid); + auto hpredemb = hypsrlgraph_m.get_role_filler_embs(hyppredid); + predsim = (*phrasesim)(refpredphrase, hyppredphrase, rpredemb, hpredemb, yisi::REF_MODE); + } refpredmatch.add_weight(refpredid, hyppredid, predsim.second); hyppredmatch.add_weight(refpredid, hyppredid, predsim.first); } @@ -139,12 +155,18 @@ namespace yisi { maxmatching_t argmatch; for (auto it = refargs.begin(); it != refargs.end(); it++) { auto refargid = *it; - auto refargphrase = refsrlgraph_m[refid].get_role_fillers(refargid); + auto refargphrase = refsrlgraph_m[refid].get_role_filler_units(refargid); for (auto jt = hypargs.begin(); jt != hypargs.end(); jt++) { auto hypargid = *jt; - auto hypargphrase = hypsrlgraph_m.get_role_fillers(hypargid); - std::pair argsim = - (*phrasesim)(refargphrase, hypargphrase, yisi::REF_MODE); + auto hypargphrase = hypsrlgraph_m.get_role_filler_units(hypargid); + std::pair argsim; + if (refsrlgraph_m[refid].get_sent_type() != "uemb" || hypsrlgraph_m.get_sent_type() != "uemb") { + argsim = (*phrasesim)(refargphrase, hypargphrase, yisi::REF_MODE); + } else { + auto rargemb = refsrlgraph_m[refid].get_role_filler_embs(refargid); + auto hargemb = hypsrlgraph_m.get_role_filler_embs(hypargid); + argsim = (*phrasesim)(refargphrase, hypargphrase, rargemb, hargemb, yisi::REF_MODE); + } argmatch.add_weight(refargid, hypargid, argsim.second); } // for jt } // for it @@ -164,22 +186,27 @@ namespace yisi { auto aligned_hyp_pred = hpr[i].first.second; auto psim = hpr[i].second; if (hypalignment_m.find(aligned_hyp_pred) == hypalignment_m.end()) { - hypalignment_m[aligned_hyp_pred] = - std::vector >(); + hypalignment_m[aligned_hyp_pred] = std::vector >(); } hypalignment_m[aligned_hyp_pred].push_back(std::make_pair(refid, - alignment_type(aligned_ref_pred, psim))); + alignment_type(aligned_ref_pred, psim))); auto refargs = refsrlgraph_m[refid].get_args(aligned_ref_pred); auto hypargs = hypsrlgraph_m.get_args(aligned_hyp_pred); maxmatching_t argmatch; for (auto it = refargs.begin(); it != refargs.end(); it++) { auto refargid = *it; - auto refargphrase = refsrlgraph_m[refid].get_role_fillers(refargid); + auto refargphrase = refsrlgraph_m[refid].get_role_filler_units(refargid); for (auto jt = hypargs.begin(); jt != hypargs.end(); jt++) { auto hypargid = *jt; - auto hypargphrase = hypsrlgraph_m.get_role_fillers(hypargid); - std::pair argsim = - (*phrasesim)(refargphrase, hypargphrase, yisi::REF_MODE); + auto hypargphrase = hypsrlgraph_m.get_role_filler_units(hypargid); + std::pair argsim; + if (refsrlgraph_m[refid].get_sent_type() != "uemb" || hypsrlgraph_m.get_sent_type() != "uemb") { + argsim = (*phrasesim)(refargphrase, hypargphrase, yisi::REF_MODE); + } else { + auto rargemb = refsrlgraph_m[refid].get_role_filler_embs(refargid); + auto hargemb = hypsrlgraph_m.get_role_filler_embs(hypargid); + argsim = (*phrasesim)(refargphrase, hypargphrase, rargemb, hargemb, yisi::REF_MODE); + } argmatch.add_weight(refargid, hypargid, argsim.first); } // for jt } // for it @@ -190,26 +217,38 @@ namespace yisi { auto asim = ar[j].second; if (hypalignment_m.find(aligned_hyp_arg) == hypalignment_m.end()) { hypalignment_m[aligned_hyp_arg] = - std::vector >(); + std::vector >(); } hypalignment_m[aligned_hyp_arg].push_back(std::make_pair(refid, - alignment_type(aligned_ref_arg, asim))); + alignment_type(aligned_ref_arg, asim))); } // for j } // for i } // for refid //input if (inp_b) { - auto r = inpsrlgraph_m.get_sentence(); - auto h = hypsrlgraph_m.get_sentence(); - std::pair sentsim = (*phrasesim)(r, h, yisi::INP_MODE); + //std::cerr << "first align the sentence node of inp: "; auto inproot = inpsrlgraph_m.get_root(); auto hyproot = hypsrlgraph_m.get_root(); + auto r = inpsrlgraph_m.get_role_filler_units(inproot); + //std::cerr<< r.size(); + auto h = hypsrlgraph_m.get_role_filler_units(hyproot); + //std::cerr<< h.size(); + std::pair sentsim; + if (inpsrlgraph_m.get_sent_type() != "uemb" || hypsrlgraph_m.get_sent_type() != "uemb") { + sentsim = (*phrasesim)(r, h, yisi::INP_MODE); + } else { + auto remb = inpsrlgraph_m.get_role_filler_embs(inproot); + auto hemb = hypsrlgraph_m.get_role_filler_embs(hyproot); + //std::cerr<< remb.size() <<" " < >(); } hypalignment_m[hyproot].push_back(std::make_pair((int)refsrlgraph_m.size(), - alignment_type(inproot, sentsim.first))); + alignment_type(inproot, sentsim.first))); auto inppreds = inpsrlgraph_m.get_preds(); auto hyppreds = hypsrlgraph_m.get_preds(); maxmatching_t inppredmatch; @@ -218,14 +257,20 @@ namespace yisi { auto inppredid = *it; auto inppredspan = inpsrlgraph_m.get_role_span(inppredid); if (inppredspan.first != inppredspan.second) { - auto inppredphrase = inpsrlgraph_m.get_role_fillers(inppredid); + auto inppredphrase = inpsrlgraph_m.get_role_filler_units(inppredid); for (auto jt = hyppreds.begin(); jt != hyppreds.end(); jt++) { auto hyppredid = *jt; auto hyppredspan = hypsrlgraph_m.get_role_span(hyppredid); if (hyppredspan.first != hyppredspan.second) { - auto hyppredphrase = hypsrlgraph_m.get_role_fillers(hyppredid); - std::pair predsim = - (*phrasesim)(inppredphrase, hyppredphrase, yisi::INP_MODE); + auto hyppredphrase = hypsrlgraph_m.get_role_filler_units(hyppredid); + std::pair predsim; + if (inpsrlgraph_m.get_sent_type() != "uemb" || hypsrlgraph_m.get_sent_type() != "uemb") { + predsim = (*phrasesim)(inppredphrase, hyppredphrase, yisi::INP_MODE); + } else { + auto ipredemb = inpsrlgraph_m.get_role_filler_embs(inppredid); + auto hpredemb = hypsrlgraph_m.get_role_filler_embs(hyppredid); + predsim = (*phrasesim)(inppredphrase, hyppredphrase, ipredemb, hpredemb, yisi::INP_MODE); + } inppredmatch.add_weight(inppredid, hyppredid, predsim.second); hyppredmatch.add_weight(inppredid, hyppredid, predsim.first); } @@ -244,12 +289,18 @@ namespace yisi { maxmatching_t argmatch; for (auto it = inpargs.begin(); it != inpargs.end(); it++) { auto inpargid = *it; - auto inpargphrase = inpsrlgraph_m.get_role_fillers(inpargid); + auto inpargphrase = inpsrlgraph_m.get_role_filler_units(inpargid); for (auto jt = hypargs.begin(); jt != hypargs.end(); jt++) { auto hypargid = *jt; - auto hypargphrase = hypsrlgraph_m.get_role_fillers(hypargid); - std::pair argsim = - (*phrasesim)(inpargphrase, hypargphrase, yisi::INP_MODE); + auto hypargphrase = hypsrlgraph_m.get_role_filler_units(hypargid); + std::pair argsim; + if (inpsrlgraph_m.get_sent_type() != "uemb" || hypsrlgraph_m.get_sent_type() != "uemb") { + argsim = (*phrasesim)(inpargphrase, hypargphrase, yisi::INP_MODE); + } else { + auto iargemb = inpsrlgraph_m.get_role_filler_embs(inpargid); + auto hargemb = hypsrlgraph_m.get_role_filler_embs(hypargid); + argsim = (*phrasesim)(inpargphrase, hypargphrase, iargemb, hargemb, yisi::INP_MODE); + } argmatch.add_weight(inpargid, hypargid, argsim.second); } } @@ -267,21 +318,27 @@ namespace yisi { auto psim = hpr[i].second; if (hypalignment_m.find(aligned_hyp_pred) == hypalignment_m.end()) { hypalignment_m[aligned_hyp_pred] = - std::vector >(); + std::vector >(); } hypalignment_m[aligned_hyp_pred].push_back(std::make_pair((int)refsrlgraph_m.size(), - alignment_type(aligned_inp_pred, psim))); + alignment_type(aligned_inp_pred, psim))); auto inpargs = inpsrlgraph_m.get_args(aligned_inp_pred); auto hypargs = hypsrlgraph_m.get_args(aligned_hyp_pred); maxmatching_t argmatch; for (auto it = inpargs.begin(); it != inpargs.end(); it++) { auto inpargid = *it; - auto inpargphrase = inpsrlgraph_m.get_role_fillers(inpargid); + auto inpargphrase = inpsrlgraph_m.get_role_filler_units(inpargid); for (auto jt = hypargs.begin(); jt != hypargs.end(); jt++) { auto hypargid = *jt; - auto hypargphrase = hypsrlgraph_m.get_role_fillers(hypargid); - std::pair argsim = - (*phrasesim)(inpargphrase, hypargphrase, yisi::INP_MODE); + auto hypargphrase = hypsrlgraph_m.get_role_filler_units(hypargid); + std::pair argsim; + if (inpsrlgraph_m.get_sent_type() != "uemb" || hypsrlgraph_m.get_sent_type() != "uemb") { + argsim = (*phrasesim)(inpargphrase, hypargphrase, yisi::INP_MODE); + } else { + auto iargemb = inpsrlgraph_m.get_role_filler_embs(inpargid); + auto hargemb = hypsrlgraph_m.get_role_filler_embs(hypargid); + argsim = (*phrasesim)(inpargphrase, hypargphrase, iargemb, hargemb, yisi::INP_MODE); + } argmatch.add_weight(inpargid, hypargid, argsim.first); } } @@ -292,17 +349,17 @@ namespace yisi { auto asim = ar[j].second; if (hypalignment_m.find(aligned_hyp_arg) == hypalignment_m.end()) { hypalignment_m[aligned_hyp_arg] = - std::vector >(); + std::vector >(); } hypalignment_m[aligned_hyp_arg].push_back(std::make_pair((int)refsrlgraph_m.size(), - alignment_type(aligned_inp_arg, asim))); + alignment_type(aligned_inp_arg, asim))); } } } } // align std::ostream& operator<<(std::ostream& os, const yisi::yisigraph_t& m); - + } // yisi diff --git a/src/yisiscorer.h b/src/yisiscorer.h index 94d4d2d..9b97e11 100644 --- a/src/yisiscorer.h +++ b/src/yisiscorer.h @@ -30,674 +30,676 @@ namespace yisi { - struct yisi_options { - std::string inpsrl_name_m; - std::string inpsrl_path_m; - std::string refsrl_name_m; - std::string refsrl_path_m; - std::string hypsrl_name_m; - std::string hypsrl_path_m; - std::string labelconfig_path_m; - std::string weightconfig_path_m; - std::string frameweight_name_m; - - double alpha_m; - double beta_m; - - void init(com::masaers::cmdlp::parser& p) { - using namespace com::masaers::cmdlp; - - p.add(make_knob(inpsrl_name_m)) - .fallback("") - .desc("Type of input language SRL: [read|mate]") - .name("inpsrl-type") - ; - p.add(make_knob(inpsrl_path_m)) - .fallback("") - .desc("[read: path to assert formated parse of input sentences " - "| mate: full path and filename of .mplsconfig]") - .name("inpsrl-path") - ; - p.add(make_knob(hypsrl_name_m)) - .fallback("") - .desc("Type of output language SRL: [read|mate]") - .name("outsrl-type") - .name("hypsrl-type") - .name("srl-type") - ; - p.add(make_knob(hypsrl_path_m)) - .fallback("") - .desc("[read: path to assert formatted parse output " - "| mate: full path and filename of .mplsconfig]") - .name("outsrl-path") - .name("hypsrl-path") - .name("srl-path") - ; - p.add(make_knob(refsrl_name_m)) - .fallback("") - .desc("Type of reference SRL (specify only if it is different from the hypothesis SRL): [read|mate]") - .name("refsrl-type") - ; - p.add(make_knob(refsrl_path_m)) - .fallback("") - .desc("[read: path to assert formatted parse reference " - "| mate: full path and filename of .mplsconfig]") - .name("refsrl-path") - ; - p.add(make_knob(labelconfig_path_m)) - .fallback("") - .desc("Path to YiSi SRL role label config file") - .name("labelconfig-path") - ; - p.add(make_knob(weightconfig_path_m)) - .fallback("") - .desc("Path to YiSi SRL role label config file (default: " - " to use YiSi unsupervised estimation of weight") - .name("weightconfig-path") - ; - p.add(make_knob(frameweight_name_m)) - .fallback("coverage") - .desc("Type of frame weight function: [uniform|coverage(default)]") - .name("frameweight-type") - ; - p.add(make_knob(beta_m)) - .fallback(0.0) - .desc("Beta value of YiSi [0.0(default)]") - .name("beta") - ; - p.add(make_knob(alpha_m)) - .fallback(0.5) - .desc("Ratio of precision & recall in YiSi") - .name("alpha") - ; - } - }; // struct yisi_options - - template - class yisiscorer_t { - public: - typedef opt_T opt_type; - - yisiscorer_t() {} - - yisiscorer_t(opt_T opt) { - alpha_m = opt.alpha_m; - frameweight_name_m = opt.frameweight_name_m; - alpha_m = opt.alpha_m; - beta_m = opt.beta_m; - - int i = 0; - if (opt.labelconfig_path_m != "") { - std::cerr << "Reading labelconfig from " << opt.labelconfig_path_m << " ... "; - std::ifstream LBL(opt.labelconfig_path_m.c_str()); - if (!LBL) { - std::cerr << "ERROR: Failed to open labelconfig. Exiting..." << std::endl; - exit(1); - } - while (!LBL.eof()) { - std::string line; - getline(LBL, line); - if (line != "") { - std::istringstream iss(line); - while (!iss.eof()) { - std::string label; - iss >> label; - label_m[label] = i; - } - i++; - } - } - LBL.close(); - std::cerr << "Done." << std::endl; - } - - weightconfig_path_m = opt.weightconfig_path_m; - if (weightconfig_path_m != "" - && weightconfig_path_m != "lexweight" - && weightconfig_path_m != "uniform") { - std::cerr << "Reading weightconfig from " << opt.weightconfig_path_m << " ... "; - std::ifstream W(weightconfig_path_m.c_str()); - if (!W) { - std::cerr << "ERROR: Failed to open weightconfig. Exiting..." << std::endl; - exit(1); - } - while (!W.eof()) { - double w; - W >> w; - weight_m.push_back(w); - } - W.close(); - std::cerr << "Done." << std::endl; - if ((int)weight_m.size() != i) { - std::cerr << "ERROR: Number of weights in weightconfig does not match " - << "with number of lines in labelconfig. Exiting..." << std::endl; - exit(1); - } - } else { - for (int j = 0; j < i; j++) { - weight_m.push_back(1.0); - } - } - - phrasesim_p = new phrasesim_t(opt); - hypsrl_p = new srl_t(opt.hypsrl_name_m, opt.hypsrl_path_m); - hypsrl_name_m = opt.hypsrl_name_m; - if (opt.refsrl_name_m != ""){ - refsrl_p = new srl_t(opt.refsrl_name_m, opt.refsrl_path_m); - } else { - refsrl_p = hypsrl_p;; + struct yisi_options { + std::string inpsrl_name_m; + std::string inpsrl_path_m; + std::string refsrl_name_m; + std::string refsrl_path_m; + std::string hypsrl_name_m; + std::string hypsrl_path_m; + + std::string labelconfig_path_m; + std::string weightconfig_path_m; + std::string frameweight_name_m; + + double alpha_m; + double beta_m; + + void init(com::masaers::cmdlp::parser& p) { + using namespace com::masaers::cmdlp; + + p.add(make_knob(inpsrl_name_m)) + .fallback("") + .desc("Type of input language SRL: [read|mate]") + .name("inpsrl-type") + ; + p.add(make_knob(inpsrl_path_m)) + .fallback("") + .desc("[read: path to assert formated parse of input sentences " + "| mate: full path and filename of .mplsconfig]") + .name("inpsrl-path") + ; + p.add(make_knob(hypsrl_name_m)) + .fallback("") + .desc("Type of output language SRL: [read|mate]") + .name("outsrl-type") + .name("hypsrl-type") + .name("srl-type") + ; + p.add(make_knob(hypsrl_path_m)) + .fallback("") + .desc("[read: path to assert formatted parse output " + "| mate: full path and filename of .mplsconfig]") + .name("outsrl-path") + .name("hypsrl-path") + .name("srl-path") + ; + p.add(make_knob(refsrl_name_m)) + .fallback("") + .desc("Type of reference SRL (specify only if it is different from " + "the hypothesis SRL): [read|mate]") + .name("refsrl-type") + ; + p.add(make_knob(refsrl_path_m)) + .fallback("") + .desc("[read: path to assert formatted parse reference " + "| mate: full path and filename of .mplsconfig]") + .name("refsrl-path") + ; + p.add(make_knob(labelconfig_path_m)) + .fallback("") + .desc("Path to YiSi SRL role label config file") + .name("labelconfig-path") + ; + p.add(make_knob(weightconfig_path_m)) + .fallback("") + .desc("Path to YiSi SRL role label config file (default: " + " to use YiSi unsupervised estimation of weight") + .name("weightconfig-path") + ; + p.add(make_knob(frameweight_name_m)) + .fallback("coverage") + .desc("Type of frame weight function: [uniform|coverage(default)]") + .name("frameweight-type") + ; + p.add(make_knob(beta_m)) + .fallback(0.0) + .desc("Beta value of YiSi [0.0(default)]") + .name("beta") + ; + p.add(make_knob(alpha_m)) + .fallback(0.5) + .desc("Ratio of precision & recall in YiSi") + .name("alpha") + ; } - refsrl_name_m = opt.refsrl_name_m; - inpsrl_p = new srl_t(opt.inpsrl_name_m, opt.inpsrl_path_m); - inpsrl_name_m = opt.inpsrl_name_m; - } // yisiscorer_t - - ~yisiscorer_t() { - if (phrasesim_p != NULL) { - delete phrasesim_p; - phrasesim_p = NULL; + }; // struct yisi_options + + template + class yisiscorer_t { + public: + typedef opt_T opt_type; + + yisiscorer_t() {} + + yisiscorer_t(opt_T opt) { + alpha_m = opt.alpha_m; + frameweight_name_m = opt.frameweight_name_m; + alpha_m = opt.alpha_m; + beta_m = opt.beta_m; + + int i = 0; + if (opt.labelconfig_path_m != "") { + std::cerr << "Reading labelconfig from " << opt.labelconfig_path_m << " ... "; + std::ifstream LBL(opt.labelconfig_path_m.c_str()); + if (!LBL) { + std::cerr << "ERROR: Failed to open labelconfig. Exiting..." << std::endl; + exit(1); + } + while (!LBL.eof()) { + std::string line; + getline(LBL, line); + if (line != "") { + std::istringstream iss(line); + while (!iss.eof()) { + std::string label; + iss >> label; + label_m[label] = i; + } + i++; + } + } + LBL.close(); + std::cerr << "Done." << std::endl; + } + + weightconfig_path_m = opt.weightconfig_path_m; + if (weightconfig_path_m != "" + && weightconfig_path_m != "lexweight" + && weightconfig_path_m != "uniform") { + std::cerr << "Reading weightconfig from " << opt.weightconfig_path_m << " ... "; + std::ifstream W(weightconfig_path_m.c_str()); + if (!W) { + std::cerr << "ERROR: Failed to open weightconfig. Exiting..." << std::endl; + exit(1); + } + while (!W.eof()) { + double w; + W >> w; + weight_m.push_back(w); + } + W.close(); + std::cerr << "Done." << std::endl; + if ((int)weight_m.size() != i) { + std::cerr << "ERROR: Number of weights in weightconfig does not match " + << "with number of lines in labelconfig. Exiting..." << std::endl; + exit(1); + } + } else { + for (int j = 0; j < i; j++) { + weight_m.push_back(1.0); + } + } + + phrasesim_p = new phrasesim_t(opt); + hypsrl_p = new srl_t(opt.hypsrl_name_m, opt.hypsrl_path_m); + hypsrl_name_m = opt.hypsrl_name_m; + if (opt.refsrl_name_m != "") { + refsrl_p = new srl_t(opt.refsrl_name_m, opt.refsrl_path_m); + } else { + refsrl_p = hypsrl_p;; + } + refsrl_name_m = opt.refsrl_name_m; + inpsrl_p = new srl_t(opt.inpsrl_name_m, opt.inpsrl_path_m); + inpsrl_name_m = opt.inpsrl_name_m; + } // yisiscorer_t + + ~yisiscorer_t() { + if (phrasesim_p != NULL) { + delete phrasesim_p; + phrasesim_p = NULL; + } + if (inpsrl_p != NULL) { + delete inpsrl_p; + inpsrl_p = NULL; + } + if (hypsrl_p != NULL) { + delete hypsrl_p; + hypsrl_p = NULL; + if (refsrl_name_m == "") { + refsrl_p = NULL; + } + } + if (refsrl_p != NULL) { + delete refsrl_p; + refsrl_p = NULL; + } } - if (inpsrl_p != NULL) { - delete inpsrl_p; - inpsrl_p = NULL; + + void writecache() { + phrasesim_p->writecache(); } - if (hypsrl_p != NULL) { - delete hypsrl_p; - hypsrl_p = NULL; - if (refsrl_name_m == ""){ - refsrl_p = NULL; - } + + void readcache() { + phrasesim_p->readcache(); } - if (refsrl_p != NULL) { - delete refsrl_p; - refsrl_p = NULL; + + void estimate_weight(std::vector srls) { + for (auto it = srls.begin(); it != srls.end(); it++) { + auto preds = it->get_preds(); + for (auto jt = preds.begin(); jt != preds.end(); jt++) { + auto pred_label = it->get_role_label(*jt); + if (label_m.find(pred_label) == label_m.end()) { + std::cerr << "ERROR: Unknown predicate label '" << pred_label + << "'. Check your labelconfig. Exiting..." << std::endl; + exit(1); + } + weight_m[label_m[pred_label]] += 0.25; + auto args = it->get_args(*jt); + for (auto kt = args.begin(); kt != args.end(); kt++) { + auto arg_label = it->get_role_label(*kt); + if (label_m.find(arg_label) == label_m.end()) { + std::cerr << "ERROR: Unknown argument label '" << arg_label + << "'. Check your labelconfig. Exiting..." << std::endl; + exit(1); + } + weight_m[label_m[arg_label]] += 1.0; + } + } + } } - } - - void writecache() { - phrasesim_p->writecache(); - } - - void readcache() { - phrasesim_p->readcache(); - } - - void estimate_weight(std::vector srls) { - for (auto it = srls.begin(); it != srls.end(); it++) { - auto preds = it->get_preds(); - for (auto jt = preds.begin(); jt != preds.end(); jt++) { - auto pred_label = it->get_role_label(*jt); - if (label_m.find(pred_label) == label_m.end()) { - std::cerr << "ERROR: Unknown predicate label '" << pred_label - << "'. Check your labelconfig. Exiting..." << std::endl; - exit(1); - } - weight_m[label_m[pred_label]] += 0.25; - auto args = it->get_args(*jt); - for (auto kt = args.begin(); kt != args.end(); kt++) { - auto arg_label = it->get_role_label(*kt); - if (label_m.find(arg_label) == label_m.end()) { - std::cerr << "ERROR: Unknown argument label '" << arg_label - << "'. Check your labelconfig. Exiting..." << std::endl; - exit(1); - } - weight_m[label_m[arg_label]] += 1.0; - } - } + + void estimate_weight(std::vector > msrls) { + for (auto it = msrls.begin(); it != msrls.end(); it++) { + estimate_weight(*it); + } } - } - - void estimate_weight(std::vector > msrls) { - for (auto it = msrls.begin(); it != msrls.end(); it++) { - estimate_weight(*it); + + std::vector inpsrlparse(std::vector inpsents) { + //std::cerr << "Tokenizing/SRL-ing the input ..."; + std::vector result = inpsrl_p->parse(inpsents); + //std::cerr << "Done." << std::endl; + if (weightconfig_path_m == "") { + this->estimate_weight(result); + } + return result; } - } - - std::vector inpsrlparse(std::vector inpsents) { - //std::cerr << "Tokenizing/SRL-ing the input ..."; - std::vector result = inpsrl_p->parse(inpsents); - //std::cerr << "Done." << std::endl; - if (weightconfig_path_m == "") { - this->estimate_weight(result); + + std::vector refsrlparse(std::vector refsents) { + //std::cerr << "Tokenizing/SRL-ing the references ... "; + std::vector result = refsrl_p->parse(refsents); + //std::cerr << "Done." << std::endl; + if (weightconfig_path_m == "") { + this->estimate_weight(result); + } + return result; } - return result; - } - - std::vector refsrlparse(std::vector refsents) { - //std::cerr << "Tokenizing/SRL-ing the references ... "; - std::vector result = refsrl_p->parse(refsents); - //std::cerr << "Done." << std::endl; - if (weightconfig_path_m == "") { - this->estimate_weight(result); + + std::vector hypsrlparse(std::vector hypsents) { + //std::cerr << "Tokenizing/SRL-ing the hypotheses ... "; + std::vector result = hypsrl_p->parse(hypsents); + //std::cerr << "Done." << std::endl; + return result; } - return result; - } - - std::vector hypsrlparse(std::vector hypsents) { - //std::cerr << "Tokenizing/SRL-ing the hypotheses ... "; - std::vector result = hypsrl_p->parse(hypsents); - //std::cerr << "Done." << std::endl; - return result; - } - - srlgraph_t hypsrlparse(std::string hypsent) { - //std::cerr <<"Tokenizing/SRL-ing the hypothesis ... "; - srlgraph_t result = hypsrl_p->parse(hypsent); - //std::cerr << "Done." << std::endl; - return result; - } - - yisigraph_t align(const std::vector refsrlgraph, const srlgraph_t hypsrlgraph) { - //std::cerr << "Creating YiSi graph ... "; - yisigraph_t result(refsrlgraph, hypsrlgraph); - //std::cerr << "start aligning ... "; - result.align(phrasesim_p); - //result.print(std::cerr); - //std::cerr << "Done." << std::endl; - return result; - } - - yisigraph_t align(const std::vector refsrlgraph, - const srlgraph_t hypsrlgraph, const srlgraph_t inpsrlgraph) { - //std::cerr << "Creating YiSi graph with input... "; - yisigraph_t result(refsrlgraph, hypsrlgraph, inpsrlgraph); - //std::cerr << "start aligning ... "; - result.align(phrasesim_p); - //result.print(std::cerr); - //std::cerr << "Done." << std::endl; - return result; - }; - - double score(yisigraph_t& yg) { - double precision = score(yg, yisi::HYP_MODE); - double recall = score(yg, yisi::REF_MODE); - double yisi = 0.0; - if (precision == 0.0 || recall == 0.0) { - yisi = 0.0; - } else { - yisi = (precision * recall) / (alpha_m * precision + (1.0 - alpha_m) * recall); + + srlgraph_t hypsrlparse(sent_t* hypsent) { + //std::cerr <<"Tokenizing/SRL-ing the hypothesis ... "; + srlgraph_t result = hypsrl_p->parse(hypsent); + //std::cerr << "Done." << std::endl; + return result; } - return yisi; - //double flat = yg.get_sentsim(); - //if (mode_m == "flat") { - // return flat; - //} else { - // //std::cerr<<"Computing YiSi precision ... "; - // double precision = score(yg, yisi::HYP_MODE); - // //std::cerr<<"Done."< features(yisigraph_t& yg) { - std::vector result; - //double flat = yg.get_sentsim(); - //result.push_back(flat); - //result.push_back(score(yg)); - std::vector precision = features(yg, yisi::HYP_MODE); - std::vector recall = features(yg, yisi::REF_MODE); - for (auto it = precision.begin(); it != precision.end(); it++) { - result.push_back(*it); + + yisigraph_t align(const std::vector refsrlgraph, const srlgraph_t hypsrlgraph) { + //std::cerr << "Creating YiSi graph ... "; + yisigraph_t result(refsrlgraph, hypsrlgraph); + //std::cerr << "start aligning ... "; + result.align(phrasesim_p); + //result.print(std::cerr); + //std::cerr << "Done." << std::endl; + return result; } - for (auto it = recall.begin(); it != recall.end(); it++) { - result.push_back(*it); + + yisigraph_t align(const std::vector refsrlgraph, + const srlgraph_t hypsrlgraph, const srlgraph_t inpsrlgraph) { + //std::cerr << "Creating YiSi graph with input... "; + yisigraph_t result(refsrlgraph, hypsrlgraph, inpsrlgraph); + //std::cerr << "start aligning ... "; + result.align(phrasesim_p); + //result.print(std::cerr); + //std::cerr << "Done." << std::endl; + return result; + }; + + double score(yisigraph_t& yg) { + double precision = score(yg, yisi::HYP_MODE); + double recall = score(yg, yisi::REF_MODE); + double yisi = 0.0; + if (precision == 0.0 || recall == 0.0) { + yisi = 0.0; + } else { + yisi = (precision * recall) / (alpha_m * precision + (1.0 - alpha_m) * recall); + } + return yisi; + //double flat = yg.get_sentsim(); + //if (mode_m == "flat") { + // return flat; + //} else { + // //std::cerr<<"Computing YiSi precision ... "; + // double precision = score(yg, yisi::HYP_MODE); + // //std::cerr<<"Done."< 0) { - // // if (prfunc_name_m=="f" || prfunc_name_m=="max"){ - // double fw = yg.get_rolespanlength(predid, mode); - // double fn = 0.0; - // if (predsim >= rolesim_threshold_m) { - // fn = predweight * predsim; - // } - // double fd = predweight; - // auto args = yg.get_args(predid, mode); - // for (auto jt = args.begin(); jt != args.end(); jt++) { - // auto argid = *jt; - // fw += yg.get_rolespanlength(argid, mode); - - // auto arglabel = yg.get_rolelabel(argid, mode); - // auto alignlabel = yg.get_alignlabel(argid, mode); - // double argsim = yg.get_alignsim(argid, mode); - // double argweight = get_roleweight(yg, argid, mode); - // if (argsim >= rolesim_threshold_m - // && match(arglabel, alignlabel)) { - // fn += argweight * argsim; - // } - // fd += argweight; - // } - // if (fn > 0 && fd > 0) { - // if (frameweight_name_m == "coverage") { - // nom += fw * (fn / fd); - // } else { - // nom += fn / fd; - // } - // } - // if (frameweight_name_m == "coverage") { - // denom += fw; - // } else { - // denom += 1; - // } - // } else { - // if (predsim >= rolesim_threshold_m) { - // nom = predweight * predsim; - // } - // denom += predweight; - // auto args = yg.get_args(predid, mode); - // for (auto jt = args.begin(); jt != args.end(); jt++) { - // auto argid = *jt; - // auto arglabel = yg.get_rolelabel(argid, mode); - // auto alignlabel = yg.get_alignlabel(argid, mode); - // double argsim = yg.get_alignsim(argid, mode); - // double argweight = get_roleweight(yg, argid, mode); - // if (argsim >= rolesim_threshold_m - // && match(arglabel, alignlabel)) { - // nom += argweight * argsim; - // } - // denom += argweight; - // } - - // } - - //} - //} - //if (nom > 0 && denom > 0) { - // return nom/denom; - //} else { - // return 0.0; - //} - } - - std::vector features(yisigraph_t yg, int mode) { - if (mode == yisi::REF_MODE) { - return rfeatures(yg); - } else { - return pfeatures(yg); + + std::vector features(yisigraph_t& yg) { + std::vector result; + //double flat = yg.get_sentsim(); + //result.push_back(flat); + //result.push_back(score(yg)); + std::vector precision = features(yg, yisi::HYP_MODE); + std::vector recall = features(yg, yisi::REF_MODE); + for (auto it = precision.begin(); it != precision.end(); it++) { + result.push_back(*it); + } + for (auto it = recall.begin(); it != recall.end(); it++) { + result.push_back(*it); + } + return result; } - } - - void compute_features(yisigraph_t yg, std::vector feats, - double& structure, double& flat, int mode, int refid = -1) { - flat = yg.get_sentsim(mode, refid); - - double tfw = 0.0; // total frame weight - //std::vector tsim(weight_m.size(), 0.0); // total similarity by role type - //std::vector tcount(weight_m.size(), 0.0); // total count by role type - double nom = 0.0; - double denom = 0.0; - - auto preds = yg.get_preds(mode, refid); - - for (auto it = preds.begin(); it != preds.end(); it++) { - std::vector sim(weight_m.size(), 0.0); - std::vector count(weight_m.size(), 0.0); - auto predid = *it; - double sanity_check = yg.get_rolespanlength(predid, mode, refid); - double predsim = yg.get_alignsim(predid, mode, refid); - auto predlabel = yg.get_rolelabel(predid, mode, refid); - double predweight = get_roleweight(yg, predid, mode, refid); - - if (sanity_check > 0) { - //if (prfunc_name_m=="f" || prfunc_name_m=="max"){ - double fw = yg.get_rolespanlength(predid, mode, refid); - double fn = 0.0; - - sim[label_m[predlabel]] += predsim; - fn = predweight * predsim; - - double fd = predweight; - count[label_m[predlabel]] += 1.0; - - auto args = yg.get_args(predid, mode, refid); - for (auto jt = args.begin(); jt != args.end(); jt++) { - auto argid = *jt; - fw += yg.get_rolespanlength(argid, mode, refid); - - auto arglabel = yg.get_rolelabel(argid, mode, refid); - double argsim = 0.0; - yisigraph_t::label_type alignlabel; - if (mode == yisi::HYP_MODE) { - auto alignment = yg.get_hypalignment(argid); - for (auto it = alignment.begin(); it != alignment.end(); it++) { - double s = (it->second).second; - int id = it->first; - yisigraph_t::label_type l; - if (id < (int)yg.get_refsize()) { - l = yg.get_rolelabel((it->second).first, yisi::REF_MODE, id); - } else { - l = yg.get_rolelabel((it->second).first, yisi::INP_MODE); - } - if (s > argsim && match(arglabel, l)) { - argsim = s; - alignlabel = l; - } - } - } else { - alignlabel = yg.get_alignlabel(argid, mode, refid); - argsim = yg.get_alignsim(argid, mode, refid); - } - - double argweight = get_roleweight(yg, argid, mode, refid); - - sim[label_m[arglabel]] += argsim; - fn += argweight * argsim; - - count[label_m[arglabel]] += 1.0; - fd += argweight; - } - - if (fn > 0 && fd > 0) { - if (frameweight_name_m == "coverage") { - nom += fw * (fn / fd); - } else { - nom += fn / fd; - } - } - if (frameweight_name_m == "coverage") { - denom += fw; - } else { - denom += 1; - } - - for (size_t i = 0; i < feats.size(); i++) { - if (count[i] > 0) { - feats[i] += fw * (sim[i] / count[i]); - } - } - tfw += fw; - } + + private: + double score(yisigraph_t yg, int mode) { + //std::cerr <<"Scoring..."; + auto f = features(yg, mode); + double structure = f[weight_m.size()]; + double flat = f[weight_m.size() + 1]; + //std::cerr <<"(" << beta_m <<"," < 0) { + // // if (prfunc_name_m=="f" || prfunc_name_m=="max"){ + // double fw = yg.get_rolespanlength(predid, mode); + // double fn = 0.0; + // if (predsim >= rolesim_threshold_m) { + // fn = predweight * predsim; + // } + // double fd = predweight; + // auto args = yg.get_args(predid, mode); + // for (auto jt = args.begin(); jt != args.end(); jt++) { + // auto argid = *jt; + // fw += yg.get_rolespanlength(argid, mode); + + // auto arglabel = yg.get_rolelabel(argid, mode); + // auto alignlabel = yg.get_alignlabel(argid, mode); + // double argsim = yg.get_alignsim(argid, mode); + // double argweight = get_roleweight(yg, argid, mode); + // if (argsim >= rolesim_threshold_m + // && match(arglabel, alignlabel)) { + // fn += argweight * argsim; + // } + // fd += argweight; + // } + // if (fn > 0 && fd > 0) { + // if (frameweight_name_m == "coverage") { + // nom += fw * (fn / fd); + // } else { + // nom += fn / fd; + // } + // } + // if (frameweight_name_m == "coverage") { + // denom += fw; + // } else { + // denom += 1; + // } + // } else { + // if (predsim >= rolesim_threshold_m) { + // nom = predweight * predsim; + // } + // denom += predweight; + // auto args = yg.get_args(predid, mode); + // for (auto jt = args.begin(); jt != args.end(); jt++) { + // auto argid = *jt; + // auto arglabel = yg.get_rolelabel(argid, mode); + // auto alignlabel = yg.get_alignlabel(argid, mode); + // double argsim = yg.get_alignsim(argid, mode); + // double argweight = get_roleweight(yg, argid, mode); + // if (argsim >= rolesim_threshold_m + // && match(arglabel, alignlabel)) { + // nom += argweight * argsim; + // } + // denom += argweight; + // } + + // } + + //} + //} + //if (nom > 0 && denom > 0) { + // return nom/denom; + //} else { + // return 0.0; + //} } - if (tfw > 0) { - for (size_t i = 0; i < feats.size(); i++) { - feats[i] /= tfw; - } + + std::vector features(yisigraph_t yg, int mode) { + if (mode == yisi::REF_MODE) { + return rfeatures(yg); + } else { + return pfeatures(yg); + } } - - //if (prfunc_name_m == "lexexp") { - // for (size_t i = 0; i < tsim.size(); i++) { - // if (tcount[i] > 0) { - // result[i] = tsim[i] / tcount[i]; - // } - // } - //} - if (nom > 0 && denom > 0) { - structure = nom / denom; + + void compute_features(yisigraph_t yg, std::vector feats, + double& structure, double& flat, int mode, int refid = -1) { + flat = yg.get_sentsim(mode, refid); + + double tfw = 0.0; // total frame weight + //std::vector tsim(weight_m.size(), 0.0); // total similarity by role type + //std::vector tcount(weight_m.size(), 0.0); // total count by role type + double nom = 0.0; + double denom = 0.0; + + auto preds = yg.get_preds(mode, refid); + + for (auto it = preds.begin(); it != preds.end(); it++) { + std::vector sim(weight_m.size(), 0.0); + std::vector count(weight_m.size(), 0.0); + auto predid = *it; + double sanity_check = yg.get_rolespanlength(predid, mode, refid); + double predsim = yg.get_alignsim(predid, mode, refid); + auto predlabel = yg.get_rolelabel(predid, mode, refid); + double predweight = get_roleweight(yg, predid, mode, refid); + + if (sanity_check > 0) { + //if (prfunc_name_m=="f" || prfunc_name_m=="max"){ + double fw = yg.get_rolespanlength(predid, mode, refid); + double fn = 0.0; + + sim[label_m[predlabel]] += predsim; + fn = predweight * predsim; + + double fd = predweight; + count[label_m[predlabel]] += 1.0; + + auto args = yg.get_args(predid, mode, refid); + for (auto jt = args.begin(); jt != args.end(); jt++) { + auto argid = *jt; + fw += yg.get_rolespanlength(argid, mode, refid); + + auto arglabel = yg.get_rolelabel(argid, mode, refid); + double argsim = 0.0; + yisigraph_t::label_type alignlabel; + if (mode == yisi::HYP_MODE) { + auto alignment = yg.get_hypalignment(argid); + for (auto it = alignment.begin(); it != alignment.end(); it++) { + double s = (it->second).second; + int id = it->first; + yisigraph_t::label_type l; + if (id < (int)yg.get_refsize()) { + l = yg.get_rolelabel((it->second).first, yisi::REF_MODE, id); + } else { + l = yg.get_rolelabel((it->second).first, yisi::INP_MODE); + } + if (s > argsim && match(arglabel, l)) { + argsim = s; + alignlabel = l; + } + } + } else { + alignlabel = yg.get_alignlabel(argid, mode, refid); + argsim = yg.get_alignsim(argid, mode, refid); + } + + double argweight = get_roleweight(yg, argid, mode, refid); + + sim[label_m[arglabel]] += argsim; + fn += argweight * argsim; + + count[label_m[arglabel]] += 1.0; + fd += argweight; + } + + if (fn > 0 && fd > 0) { + if (frameweight_name_m == "coverage") { + nom += fw * (fn / fd); + } else { + nom += fn / fd; + } + } + if (frameweight_name_m == "coverage") { + denom += fw; + } else { + denom += 1; + } + + for (size_t i = 0; i < feats.size(); i++) { + if (count[i] > 0) { + feats[i] += fw * (sim[i] / count[i]); + } + } + tfw += fw; + } + } + if (tfw > 0) { + for (size_t i = 0; i < feats.size(); i++) { + feats[i] /= tfw; + } + } + + //if (prfunc_name_m == "lexexp") { + // for (size_t i = 0; i < tsim.size(); i++) { + // if (tcount[i] > 0) { + // result[i] = tsim[i] / tcount[i]; + // } + // } + //} + if (nom > 0 && denom > 0) { + structure = nom / denom; + } } - } - - std::vector pfeatures(yisigraph_t yg) { - std::vector result(weight_m.size(), 0.0); - double structure = 0.0; - double flat = 0.0; - - compute_features(yg, result, structure, flat, yisi::HYP_MODE); - - result.push_back(structure); - result.push_back(flat); - return result; - } - - std::vector rfeatures(yisigraph_t yg) { - std::vector result(weight_m.size(), 0.0); - double mflat = 0.0; - double mstructure = 0.0; - - //for all reference - for (size_t i = 0; i < yg.get_refsize(); i++) { - std::vector feats(weight_m.size(), 0.0); - double structure = 0.0; - double flat = 0.0; - //std::cerr << "Computing recall features for reference #" << i << " ... "; - compute_features(yg, feats, structure, flat, yisi::REF_MODE, i); - //std::cerr << "Done." << std::endl; - if (structure > mstructure) { - mstructure = structure; - result = feats; - } - if (flat > mflat) { - mflat = flat; - } + + std::vector pfeatures(yisigraph_t yg) { + std::vector result(weight_m.size(), 0.0); + double structure = 0.0; + double flat = 0.0; + + compute_features(yg, result, structure, flat, yisi::HYP_MODE); + + result.push_back(structure); + result.push_back(flat); + return result; } - - //input - if (yg.withinp()) { - std::vector feats(weight_m.size(), 0.0); - double structure = 0.0; - double flat = 0.0; - //std::cerr << "Computing recall features for input ... "; - compute_features(yg, feats, structure, flat, yisi::INP_MODE); - //std::cerr << "Done." << std::endl; - if (structure > mstructure) { - mstructure = structure; - result = feats; - } - if (flat > mflat) { - mflat = flat; - } + + std::vector rfeatures(yisigraph_t yg) { + std::vector result(weight_m.size(), 0.0); + double mflat = 0.0; + double mstructure = 0.0; + + //for all reference + for (size_t i = 0; i < yg.get_refsize(); i++) { + std::vector feats(weight_m.size(), 0.0); + double structure = 0.0; + double flat = 0.0; + //std::cerr << "Computing recall features for reference #" << i << " ... "; + compute_features(yg, feats, structure, flat, yisi::REF_MODE, i); + //std::cerr << "Done." << std::endl; + if (structure > mstructure) { + mstructure = structure; + result = feats; + } + if (flat > mflat) { + mflat = flat; + } + } + + //input + if (yg.withinp()) { + std::vector feats(weight_m.size(), 0.0); + double structure = 0.0; + double flat = 0.0; + //std::cerr << "Computing recall features for input ... "; + compute_features(yg, feats, structure, flat, yisi::INP_MODE); + //std::cerr << "Done." << std::endl; + if (structure > mstructure) { + mstructure = structure; + result = feats; + } + if (flat > mflat) { + mflat = flat; + } + } + + result.push_back(mstructure); + result.push_back(mflat); + return result; } - - result.push_back(mstructure); - result.push_back(mflat); - return result; - } - - bool match(std::string label1, std::string label2) { - if (label1 == "U" || label2 == "U") { - return false; - } else { - if (label_m.find(label1) == label_m.end()) { - std::cerr << "ERROR: Unknown srl label '" << label1 << "' in YiSi for matching label 1. " - << "Check your labelconfig. Exiting..." << std::endl; - exit(1); - } - if (label_m.find(label2) == label_m.end()) { - std::cerr << "ERROR: unknown srl label '" << label2 << "' in yisi for matching label 2. " - << "Check your labelconfig. Exiting..." << std::endl; - exit(1); - } - return (label_m[label1] == label_m[label2]); + + bool match(std::string label1, std::string label2) { + if (label1 == "U" || label2 == "U") { + return false; + } else { + if (label_m.find(label1) == label_m.end()) { + std::cerr << "ERROR: Unknown srl label '" << label1 << "' in YiSi for matching label 1. " + << "Check your labelconfig. Exiting..." << std::endl; + exit(1); + } + if (label_m.find(label2) == label_m.end()) { + std::cerr << "ERROR: unknown srl label '" << label2 << "' in yisi for matching label 2. " + << "Check your labelconfig. Exiting..." << std::endl; + exit(1); + } + return (label_m[label1] == label_m[label2]); + } } - } - - double get_roleweight(yisigraph_t yg, size_t roleid, int mode, int refid = -1) { - if (weightconfig_path_m == "lexweight") { - auto fillers = yg.get_role_fillers(roleid, mode, refid); - return phrasesim_p->get_lexweight(fillers, mode); - } else { - std::string label = yg.get_rolelabel(roleid, mode, refid); - if (label_m.find(label) == label_m.end()) { - std::cerr << "ERROR: Unknown srl label '" << label << "' in yisi for get_weight. " - << "Check your labelconfig. Exiting..." << std::endl; - exit(1); - } - return weight_m[label_m[label]]; + + double get_roleweight(yisigraph_t yg, size_t roleid, int mode, int refid = -1) { + if (weightconfig_path_m == "lexweight") { + auto fillers = yg.get_role_filler_units(roleid, mode, refid); + return phrasesim_p->get_lexweight(fillers, mode); + } else { + std::string label = yg.get_rolelabel(roleid, mode, refid); + if (label_m.find(label) == label_m.end()) { + std::cerr << "ERROR: Unknown srl label '" << label << "' in yisi for get_weight. " + << "Check your labelconfig. Exiting..." << std::endl; + exit(1); + } + return weight_m[label_m[label]]; + } } - } - - phrasesim_t* phrasesim_p; - srl_t* inpsrl_p; - srl_t* refsrl_p; - srl_t* hypsrl_p; - - std::string hypsrl_name_m; - std::string refsrl_name_m; - std::string inpsrl_name_m; - std::string weightconfig_path_m; - //std::string predweight_name_m; - std::string frameweight_name_m; - //std::string prfunc_name_m; - - std::map label_m; - std::vector weight_m; - double alpha_m; - double beta_m; - }; // class yisiscorer_t + + phrasesim_t* phrasesim_p; + srl_t* inpsrl_p; + srl_t* refsrl_p; + srl_t* hypsrl_p; + + std::string hypsrl_name_m; + std::string refsrl_name_m; + std::string inpsrl_name_m; + std::string weightconfig_path_m; + //std::string predweight_name_m; + std::string frameweight_name_m; + //std::string prfunc_name_m; + + std::map label_m; + std::vector weight_m; + double alpha_m; + double beta_m; + }; // class yisiscorer_t } // yisi diff --git a/src/yisiscorer_test.cpp b/src/yisiscorer_test.cpp index 5fe033f..8f81b7c 100644 --- a/src/yisiscorer_test.cpp +++ b/src/yisiscorer_test.cpp @@ -37,8 +37,8 @@ int main(const int argc, const char* argv[]) string reffile("test_ref.en"); string hypfile("test_hyp.en"); - vector refsents = read_file(reffile); - vector hypsents = read_file(hypfile); + vector refsents = read_sent("word", reffile); + vector hypsents = read_sent("word", hypfile); auto r1 = yisi.refsrlparse(refsents); auto r2 = yisi.hypsrlparse(hypsents); @@ -51,4 +51,12 @@ int main(const int argc, const char* argv[]) cout << "YiSi score is:" << yisi.score(m) << endl; } + for (auto it = refsents.begin(); it != refsents.end(); it++) { + delete *it; + *it = NULL; + } + for (auto it = hypsents.begin(); it != hypsents.end(); it++) { + delete *it; + *it = NULL; + } } diff --git a/test/ref/srlgraph_test.out b/test/ref/srlgraph_test.out index 29f7afe..3961db3 100644 --- a/test/ref/srlgraph_test.out +++ b/test/ref/srlgraph_test.out @@ -23,7 +23,7 @@ One thing is certain : these new provisions will have a [AM-MNR negative] [TARGE One thing is certain : these new provisions will have a negative impact on [A1 voter] turn - [TARGET out] . [AM-ADV In this sense] , [A0 the measures] [AM-MOD will] [AM-MNR partially] [TARGET undermine] [A1 the American democratic system] . In this sense , the measures will partially undermine the American [A1 democratic] [TARGET system] . -Unlike in Canada , the American States are responsible for the organization of [A2 federal] [TARGET elections] [AM-LOC in the United States] . +Unlike in Canada , the American States are responsible for the organisation of [A2 federal] [TARGET elections] [AM-LOC in the United States] . It is in this spirit that a [TARGET majority] [A1 of American governments] have passed new laws since 2009 making the registration or voting process more difficult . It is in this spirit that a majority of [A2 American] [TARGET governments] have passed new laws since 2009 making the registration or voting process more difficult . It is in this spirit that [A0 a majority of American governments] have [TARGET passed] [A1 new laws] [AM-TMP since 2009] making the registration or voting process more difficult . diff --git a/test/ref/srlutil_test.out b/test/ref/srlutil_test.out index e3b4bd9..11d8b96 100644 --- a/test/ref/srlutil_test.out +++ b/test/ref/srlutil_test.out @@ -1,7 +1,7 @@ A [A0 Republican] [V strategy] [A1 to counter the re - election of Obama] A Republican strategy to [V counter] [A1 the re - election of Obama] A Republican strategy to counter the re - [V election] [A1 of Obama] -[A0 [A2 Republican] [V leaders]] justified their policy by the need to combat electoral fraud . +[A2 Republican] [V leaders] justified their policy by the need to combat electoral fraud . [A0 Republican leaders] [V justified] [A1 their policy] [A2 by the need to combat electoral fraud] . Republican leaders justified [A0 their] [V policy] by the need to combat electoral fraud . Republican leaders justified their policy by the [V need] [A1 to combat electoral fraud] . @@ -12,32 +12,32 @@ However , [A0 the Brennan Centre] considers this a myth , [V stating] [A1 that e However , the Brennan Centre considers this a myth , stating that [A1 electoral] [V fraud] is rarer in the United States than the number of people killed by lightning . However , the Brennan Centre considers this a myth , stating that electoral fraud is rarer in the United States than the [V number] [A1 of people killed by lightning] . However , the Brennan Centre considers this a myth , stating that electoral fraud is rarer in the United States than the number of [A1 people] [V killed] [A0 by lightning] . -Indeed , [A0 [A2 Republican] [V lawyers]] identified only 300 cases of electoral fraud in the United States in a decade . +Indeed , [A2 Republican] [V lawyers] identified only 300 cases of electoral fraud in the United States in a decade . [AM-DIS Indeed] , [A0 Republican lawyers] [V identified] [A1 only 300 cases of electoral fraud in the United States] [AM-TMP in a decade] . Indeed , Republican lawyers identified only 300 [V cases] [A1 of electoral fraud] in the United States in a decade . Indeed , Republican lawyers identified only 300 cases of [A1 electoral] [V fraud] in the United States in a decade . One thing is certain : [A0 these new provisions] [AM-MOD will] [V have] [A1 a negative impact on voter turn - out] . One thing is certain : these new provisions will have a [AM-MNR negative] [V impact] [A1 on voter turn - out] . One thing is certain : these new provisions will have a negative impact on [A1 voter] turn - [V out] . -[AM-ADV In this sense] , [A0 the measures] [AM-MOD will] [AM-MNR partially] [V undermine] [A1 the American democratic system] . +[AM-ADV In this sense] [AM-MOD ,] [A0 the measures] [AM-MOD will] [AM-MNR partially] [V undermine] [A1 the American democratic system] [AM-MOD .] In this sense , the measures will partially undermine the American [A1 democratic] [V system] . Unlike in Canada , the American States are responsible for the organization of [A2 federal] [V elections] [AM-LOC in the United States] . It is in this spirit that a [V majority] [A1 of American governments] have passed new laws since 2009 making the registration or voting process more difficult . -It is in this spirit that a majority of [A0 [A2 American] [V governments]] have passed new laws since 2009 making the registration or voting process more difficult . +It is in this spirit that a majority of [A2 American] [V governments] have passed new laws since 2009 making the registration or voting process more difficult . It is in this spirit that [A0 a majority of American governments] have [V passed] [A1 new laws] [AM-TMP since 2009] making the registration or voting process more difficult . -It is in this spirit that [A0 a majority of American governments] have passed [A1 new [V laws]] since 2009 making the registration or voting process more difficult . +It is in this spirit that [A0 a majority of American governments] have passed new [V laws] since 2009 making the registration or voting process more difficult . It is in this spirit that [A0 a majority of American governments] have passed new laws since 2009 [V making] [A1 the registration or voting process] [A2 more difficult] . It is in this spirit that a majority of American governments have passed new laws since 2009 making the [V registration] or voting process more difficult . It is in this spirit that a majority of American governments have passed new laws since 2009 making the registration or [V voting] process more difficult . It is in this spirit that a majority of American governments have passed new laws since 2009 making the registration or [A1 voting] [V process] more difficult . [A1 This phenomenon] [V gained] [A2 momentum] [AM-TMP following the November 2010 elections , which saw 675 new Republican representatives added in 26 States] . -[A1 This phenomenon] gained [A2 [V momentum]] following the November 2010 elections , which saw 675 new Republican representatives added in 26 States . +[A1 This phenomenon] gained [V momentum] following the November 2010 elections , which saw 675 new Republican representatives added in 26 States . This phenomenon gained momentum [V following] [A2 the November 2010 elections , which saw 675 new Republican representatives added in 26 States] . -This phenomenon gained momentum following the November 2010 [A0 elections] , [R-A0 which] [V saw] [A1 675 new Republican representatives] [C-A1 added in 26 States] . -This phenomenon gained momentum following the November 2010 elections , which saw [A0 675 new [A4 Republican] [V representatives]] added in 26 States . +This phenomenon gained momentum following [A0 the November 2010 elections ,] [R-A0 which] [V saw] [A1 675 new Republican representatives] [C-A1 added in 26 States] . +This phenomenon gained momentum following the November 2010 elections , which saw 675 new [A4 Republican] [V representatives] added in 26 States . This phenomenon gained momentum following the November 2010 elections , which saw [A1 675 new Republican representatives] [V added] [AM-LOC in 26 States] . -[A2 As a [V result] , 180 bills restricting the exercise of the right to vote in 41 States were introduced in 2011 alone .] -As a result , 180 [A0 bills] [V restricting] [A1 the exercise of the right to vote in 41 States] were introduced in 2011 alone . +[A2 As] a [V result] [A2 , 180 bills restricting the exercise of the right to vote in 41 States were introduced in 2011 alone .] +As a result , [A0 180 bills] [V restricting] [A1 the exercise of the right to vote in 41 States] were introduced in 2011 alone . As a result , 180 bills restricting the [V exercise] [A1 of the right to vote in 41 States] were introduced in 2011 alone . As a result , 180 bills restricting the exercise of the [V right] [A1 to vote in 41 States] were introduced in 2011 alone . As a result , 180 bills restricting the exercise of the right to [V vote] [AM-LOC in 41 States] were introduced in 2011 alone . diff --git a/test/ref/test_hyp.docyisi0 b/test/ref/test_hyp.docyisi0 index 86b415f..9f47e2b 100644 --- a/test/ref/test_hyp.docyisi0 +++ b/test/ref/test_hyp.docyisi0 @@ -1 +1 @@ -0.645506 +0.693223 diff --git a/test/ref/test_hyp.docyisi1_srl b/test/ref/test_hyp.docyisi1_srl index fc46d40..cd3cc4c 100644 --- a/test/ref/test_hyp.docyisi1_srl +++ b/test/ref/test_hyp.docyisi1_srl @@ -1 +1 @@ -0.639611 +0.637235 diff --git a/test/ref/test_hyp.docyisi1_srl.alt b/test/ref/test_hyp.docyisi1_srl.alt index a9768df..e39b5b3 100644 --- a/test/ref/test_hyp.docyisi1_srl.alt +++ b/test/ref/test_hyp.docyisi1_srl.alt @@ -1 +1 @@ -0.639393 +0.636885 diff --git a/test/ref/test_hyp.docyisi2_srl b/test/ref/test_hyp.docyisi2_srl index 1ffef82..1de9edb 100644 --- a/test/ref/test_hyp.docyisi2_srl +++ b/test/ref/test_hyp.docyisi2_srl @@ -1 +1 @@ -0.0652749 +0.0660683 diff --git a/test/ref/test_hyp.docyisi2_srl.alt b/test/ref/test_hyp.docyisi2_srl.alt index 462088b..55a7de9 100644 --- a/test/ref/test_hyp.docyisi2_srl.alt +++ b/test/ref/test_hyp.docyisi2_srl.alt @@ -1 +1 @@ -0.0641709 +0.0672199 diff --git a/test/ref/test_hyp.sntyisi0 b/test/ref/test_hyp.sntyisi0 index b42e130..df0cbe0 100644 --- a/test/ref/test_hyp.sntyisi0 +++ b/test/ref/test_hyp.sntyisi0 @@ -1,10 +1,10 @@ -0.738498 -0.719384 -0.689899 -0.643572 -0.499597 -0.627008 -0.596041 -0.554946 -0.583918 -0.802202 +0.894586 +0.733148 +0.753002 +0.655633 +0.57693 +0.672231 +0.614407 +0.58164 +0.595404 +0.855247 diff --git a/test/ref/test_hyp.sntyisi1_srl b/test/ref/test_hyp.sntyisi1_srl index 96cda7d..af29c7f 100644 --- a/test/ref/test_hyp.sntyisi1_srl +++ b/test/ref/test_hyp.sntyisi1_srl @@ -1,10 +1,10 @@ -0.859564 -0.691584 -0.645726 -0.632753 -0.459998 -0.592549 -0.556889 -0.546071 -0.546333 -0.864648 +0.858821 +0.695714 +0.645749 +0.633018 +0.458332 +0.577509 +0.557156 +0.534926 +0.543717 +0.86741 diff --git a/test/ref/test_hyp.sntyisi1_srl.alt b/test/ref/test_hyp.sntyisi1_srl.alt index f9eeee9..f2409b4 100644 --- a/test/ref/test_hyp.sntyisi1_srl.alt +++ b/test/ref/test_hyp.sntyisi1_srl.alt @@ -1,10 +1,10 @@ -0.859824 -0.691795 -0.645973 -0.633111 -0.455921 -0.59255 -0.557174 -0.54644 -0.546505 -0.864636 +0.858821 +0.695714 +0.645749 +0.633018 +0.454832 +0.577509 +0.557156 +0.534926 +0.543717 +0.86741 diff --git a/test/ref/test_hyp.sntyisi2_srl b/test/ref/test_hyp.sntyisi2_srl index ecaa858..121144b 100644 --- a/test/ref/test_hyp.sntyisi2_srl +++ b/test/ref/test_hyp.sntyisi2_srl @@ -1,10 +1,10 @@ -0.0464296 -0.0116361 -0.0696774 +0.0352922 +0.0116406 +0.07006 0.0665215 -0.0274319 -0.0927175 -0.00336682 +0.0273455 +0.0937853 +0.0033759 0.0519643 -0.141262 -0.141742 +0.141268 +0.159431 diff --git a/test/ref/test_hyp.sntyisi2_srl.alt b/test/ref/test_hyp.sntyisi2_srl.alt index ccab905..c2b6a05 100644 --- a/test/ref/test_hyp.sntyisi2_srl.alt +++ b/test/ref/test_hyp.sntyisi2_srl.alt @@ -1,10 +1,10 @@ -0.0354018 -0.0116361 -0.0696774 +0.0468075 +0.0116406 +0.07006 0.0665215 -0.0274319 -0.0927175 -0.00336682 +0.0273455 +0.0937853 +0.0033759 0.0519643 -0.141262 -0.14173 +0.141268 +0.159431 diff --git a/test/ref/test_ref.en.srl b/test/ref/test_ref.en.srl index cf5fe83..fb81dd8 100644 --- a/test/ref/test_ref.en.srl +++ b/test/ref/test_ref.en.srl @@ -1,7 +1,7 @@ 0: A [A0 Republican] [V strategy] [A1 to counter the re - election of Obama] 0: A Republican strategy to [V counter] [A1 the re - election of Obama] 0: A Republican strategy to counter the re - [V election] [A1 of Obama] -1: [A0 [A2 Republican] [V leaders]] justified their policy by the need to combat electoral fraud . +1: [A2 Republican] [V leaders] justified their policy by the need to combat electoral fraud . 1: [A0 Republican leaders] [V justified] [A1 their policy] [A2 by the need to combat electoral fraud] . 1: Republican leaders justified [A0 their] [V policy] by the need to combat electoral fraud . 1: Republican leaders justified their policy by the [V need] [A1 to combat electoral fraud] . @@ -12,33 +12,33 @@ 2: However , the Brennan Centre considers this a myth , stating that [A1 electoral] [V fraud] is rarer in the United States than the number of people killed by lightning . 2: However , the Brennan Centre considers this a myth , stating that electoral fraud is rarer in the United States than the [V number] [A1 of people killed by lightning] . 2: However , the Brennan Centre considers this a myth , stating that electoral fraud is rarer in the United States than the number of [A1 people] [V killed] [A0 by lightning] . -3: Indeed , [A0 [A2 Republican] [V lawyers]] identified only 300 cases of electoral fraud in the United States in a decade . +3: Indeed , [A2 Republican] [V lawyers] identified only 300 cases of electoral fraud in the United States in a decade . 3: [AM-DIS Indeed] , [A0 Republican lawyers] [V identified] [A1 only 300 cases of electoral fraud in the United States] [AM-TMP in a decade] . 3: Indeed , Republican lawyers identified only 300 [V cases] [A1 of electoral fraud] in the United States in a decade . 3: Indeed , Republican lawyers identified only 300 cases of [A1 electoral] [V fraud] in the United States in a decade . 4: One thing is certain : [A0 these new provisions] [AM-MOD will] [V have] [A1 a negative impact on voter turn - out] . 4: One thing is certain : these new provisions will have a [AM-MNR negative] [V impact] [A1 on voter turn - out] . 4: One thing is certain : these new provisions will have a negative impact on [A1 voter] turn - [V out] . -5: [AM-ADV In this sense] , [A0 the measures] [AM-MOD will] [AM-MNR partially] [V undermine] [A1 the American democratic system] . +5: [AM-ADV In this sense] [AM-MOD ,] [A0 the measures] [AM-MOD will] [AM-MNR partially] [V undermine] [A1 the American democratic system] [AM-MOD .] 5: In this sense , the measures will partially undermine the American [A1 democratic] [V system] . 6: Unlike in Canada , the American States are responsible for the [V organisation] [A1 of federal elections in the United States] . 6: Unlike in Canada , the American States are responsible for the organisation of [A2 federal] [V elections] [AM-LOC in the United States] . 7: It is in this spirit that a [V majority] [A1 of American governments] have passed new laws since 2009 making the registration or voting process more difficult . -7: It is in this spirit that a majority of [A0 [A2 American] [V governments]] have passed new laws since 2009 making the registration or voting process more difficult . +7: It is in this spirit that a majority of [A2 American] [V governments] have passed new laws since 2009 making the registration or voting process more difficult . 7: It is in this spirit that [A0 a majority of American governments] have [V passed] [A1 new laws] [AM-TMP since 2009] making the registration or voting process more difficult . -7: It is in this spirit that [A0 a majority of American governments] have passed [A1 new [V laws]] since 2009 making the registration or voting process more difficult . +7: It is in this spirit that [A0 a majority of American governments] have passed new [V laws] since 2009 making the registration or voting process more difficult . 7: It is in this spirit that [A0 a majority of American governments] have passed new laws since 2009 [V making] [A1 the registration or voting process] [A2 more difficult] . 7: It is in this spirit that a majority of American governments have passed new laws since 2009 making the [V registration] or voting process more difficult . 7: It is in this spirit that a majority of American governments have passed new laws since 2009 making the registration or [V voting] process more difficult . 7: It is in this spirit that a majority of American governments have passed new laws since 2009 making the registration or [A1 voting] [V process] more difficult . 8: [A1 This phenomenon] [V gained] [A2 momentum] [AM-TMP following the November 2010 elections , which saw 675 new Republican representatives added in 26 States] . -8: [A1 This phenomenon] gained [A2 [V momentum]] following the November 2010 elections , which saw 675 new Republican representatives added in 26 States . +8: [A1 This phenomenon] gained [V momentum] following the November 2010 elections , which saw 675 new Republican representatives added in 26 States . 8: This phenomenon gained momentum [V following] [A2 the November 2010 elections , which saw 675 new Republican representatives added in 26 States] . -8: This phenomenon gained momentum following the November 2010 [A0 elections] , [R-A0 which] [V saw] [A1 675 new Republican representatives] [C-A1 added in 26 States] . -8: This phenomenon gained momentum following the November 2010 elections , which saw [A0 675 new [A4 Republican] [V representatives]] added in 26 States . +8: This phenomenon gained momentum following [A0 the November 2010 elections ,] [R-A0 which] [V saw] [A1 675 new Republican representatives] [C-A1 added in 26 States] . +8: This phenomenon gained momentum following the November 2010 elections , which saw 675 new [A4 Republican] [V representatives] added in 26 States . 8: This phenomenon gained momentum following the November 2010 elections , which saw [A1 675 new Republican representatives] [V added] [AM-LOC in 26 States] . -9: [A2 As a [V result] , 180 bills restricting the exercise of the right to vote in 41 States were introduced in 2011 alone .] -9: As a result , 180 [A0 bills] [V restricting] [A1 the exercise of the right to vote in 41 States] were introduced in 2011 alone . +9: [A2 As] a [V result] [A2 , 180 bills restricting the exercise of the right to vote in 41 States were introduced in 2011 alone .] +9: As a result , [A0 180 bills] [V restricting] [A1 the exercise of the right to vote in 41 States] were introduced in 2011 alone . 9: As a result , 180 bills restricting the [V exercise] [A1 of the right to vote in 41 States] were introduced in 2011 alone . 9: As a result , 180 bills restricting the exercise of the [V right] [A1 to vote in 41 States] were introduced in 2011 alone . 9: As a result , 180 bills restricting the exercise of the right to [V vote] [AM-LOC in 41 States] were introduced in 2011 alone . diff --git a/test/ref/test_ref.en.srl.alt b/test/ref/test_ref.en.srl.alt index 445d76e..d7ad715 100644 --- a/test/ref/test_ref.en.srl.alt +++ b/test/ref/test_ref.en.srl.alt @@ -1,7 +1,7 @@ 0: A [A0 Republican] [V strategy] [A1 to counter the re - election of Obama] 0: A Republican strategy to [V counter] [A1 the re - election of Obama] 0: A Republican strategy to counter the re - [V election] [A1 of Obama] -1: [A0 [A2 Republican] [V leaders]] justified their policy by the need to combat electoral fraud . +1: [A2 Republican] [V leaders] justified their policy by the need to combat electoral fraud . 1: [A0 Republican leaders] [V justified] [A1 their policy] [A2 by the need to combat electoral fraud] . 1: Republican leaders justified [A0 their] [V policy] by the need to combat electoral fraud . 1: Republican leaders justified their policy by the [V need] [A1 to combat electoral fraud] . @@ -12,33 +12,33 @@ 2: However , the Brennan Centre considers this a myth , stating that [A1 electoral] [V fraud] is rarer in the United States than the number of people killed by lightning . 2: However , the Brennan Centre considers this a myth , stating that electoral fraud is rarer in the United States than the [V number] [A1 of people killed by lightning] . 2: However , the Brennan Centre considers this a myth , stating that electoral fraud is rarer in the United States than the number of [A1 people] [V killed] [A0 by lightning] . -3: Indeed , [A0 [A2 Republican] [V lawyers]] identified only 300 cases of electoral fraud in the United States in a decade . +3: Indeed , [A2 Republican] [V lawyers] identified only 300 cases of electoral fraud in the United States in a decade . 3: [AM-DIS Indeed] , [A0 Republican lawyers] [V identified] [A1 only 300 cases of electoral fraud in the United States] [AM-TMP in a decade] . 3: Indeed , Republican lawyers identified only 300 [V cases] [A1 of electoral fraud] in the United States in a decade . 3: Indeed , Republican lawyers identified only 300 cases of [A1 electoral] [V fraud] in the United States in a decade . 4: One thing is certain : [A0 these new provisions] [AM-MOD will] [V have] [A1 a negative impact on voter turn - out] . 4: One thing is certain : these new provisions will have a [AM-MNR negative] [V impact] [A1 on voter turn - out] . -4: One thing is certain : these new provisions will have a negative impact on [A1 voter] [A1 turn -] [V out] . -5: [AM-ADV In this sense] , [A0 the measures] [AM-MOD will] [AM-MNR partially] [V undermine] [A1 the American democratic system] . +4: One thing is certain : these new provisions will have a negative impact on [A1 voter turn -] [V out] . +5: [AM-ADV In this sense] [AM-MOD ,] [A0 the measures] [AM-MOD will] [AM-MNR partially] [V undermine] [A1 the American democratic system] [AM-MOD .] 5: In this sense , the measures will partially undermine the American [A1 democratic] [V system] . 6: Unlike in Canada , the American States are responsible for the [V organisation] [A1 of federal elections in the United States] . 6: Unlike in Canada , the American States are responsible for the organisation of [A2 federal] [V elections] [AM-LOC in the United States] . 7: It is in this spirit that a [V majority] [A1 of American governments] have passed new laws since 2009 making the registration or voting process more difficult . -7: It is in this spirit that a majority of [A0 [A2 American] [V governments]] have passed new laws since 2009 making the registration or voting process more difficult . +7: It is in this spirit that a majority of [A2 American] [V governments] have passed new laws since 2009 making the registration or voting process more difficult . 7: It is in this spirit that [A0 a majority of American governments] have [V passed] [A1 new laws] [AM-TMP since 2009] making the registration or voting process more difficult . -7: It is in this spirit that [A0 a majority of American governments] have passed [A1 new [V laws]] since 2009 making the registration or voting process more difficult . +7: It is in this spirit that [A0 a majority of American governments] have passed new [V laws] since 2009 making the registration or voting process more difficult . 7: It is in this spirit that [A0 a majority of American governments] have passed new laws since 2009 [V making] [A1 the registration or voting process] [A2 more difficult] . 7: It is in this spirit that a majority of American governments have passed new laws since 2009 making the [V registration] or voting process more difficult . 7: It is in this spirit that a majority of American governments have passed new laws since 2009 making the registration or [V voting] process more difficult . 7: It is in this spirit that a majority of American governments have passed new laws since 2009 making the registration or [A1 voting] [V process] more difficult . 8: [A1 This phenomenon] [V gained] [A2 momentum] [AM-TMP following the November 2010 elections , which saw 675 new Republican representatives added in 26 States] . -8: [A1 This phenomenon] gained [A2 [V momentum]] following the November 2010 elections , which saw 675 new Republican representatives added in 26 States . +8: [A1 This phenomenon] gained [V momentum] following the November 2010 elections , which saw 675 new Republican representatives added in 26 States . 8: This phenomenon gained momentum [V following] [A2 the November 2010 elections , which saw 675 new Republican representatives added in 26 States] . -8: This phenomenon gained momentum following the November 2010 [A0 elections] , [R-A0 which] [V saw] [A1 675 new Republican representatives] [C-A1 added in 26 States] . -8: This phenomenon gained momentum following the November 2010 elections , which saw [A0 675 new [A4 Republican] [V representatives]] added in 26 States . +8: This phenomenon gained momentum following [A0 the November 2010 elections ,] [R-A0 which] [V saw] [A1 675 new Republican representatives] [C-A1 added in 26 States] . +8: This phenomenon gained momentum following the November 2010 elections , which saw 675 new [A4 Republican] [V representatives] added in 26 States . 8: This phenomenon gained momentum following the November 2010 elections , which saw [A1 675 new Republican representatives] [V added] [AM-LOC in 26 States] . -9: [A2 As a [V result] , 180 bills restricting the exercise of the right to vote in 41 States were introduced in 2011 alone .] -9: As a result , 180 [A0 bills] [V restricting] [A1 the exercise of the right to vote in 41 States] were introduced in 2011 alone . +9: [A2 As] a [V result] [A2 , 180 bills restricting the exercise of the right to vote in 41 States were introduced in 2011 alone .] +9: As a result , [A0 180 bills] [V restricting] [A1 the exercise of the right to vote in 41 States] were introduced in 2011 alone . 9: As a result , 180 bills restricting the [V exercise] [A1 of the right to vote in 41 States] were introduced in 2011 alone . 9: As a result , 180 bills restricting the exercise of the [V right] [A1 to vote in 41 States] were introduced in 2011 alone . 9: As a result , 180 bills restricting the exercise of the right to [V vote] [AM-LOC in 41 States] were introduced in 2011 alone . diff --git a/test/ref/test_yisi_0.out b/test/ref/test_yisi_0.out index f64674d..c3a8c65 100644 --- a/test/ref/test_yisi_0.out +++ b/test/ref/test_yisi_0.out @@ -1,7 +1,9 @@ Constructing lcs lexsim model Learning lex weight from test_ref.en ... Done. -Tokenizing/SRL-ing hyp ... Done. -Tokenizing/SRL-ing ref ... Done. +Reading hyp sents... Done. +Reading ref sents... Done. +Creating hyp srlgraphs... Done. +Creating ref srlgraphs... Done. Evaluating line 1 Evaluating line 2 Evaluating line 3 diff --git a/test/ref/test_yisi_1.out b/test/ref/test_yisi_1.out index 754096a..65e8327 100644 --- a/test/ref/test_yisi_1.out +++ b/test/ref/test_yisi_1.out @@ -2,8 +2,10 @@ Reading w2v text model from mini.d300.en Size of voc: 500 Dimension: 300 Finished reading w2v model. Learning lex weight from test_ref.en ... Done. -Tokenizing/SRL-ing hyp ... Done. -Tokenizing/SRL-ing ref ... Done. +Reading hyp sents... Done. +Reading ref sents... Done. +Creating hyp srlgraphs... Done. +Creating ref srlgraphs... Done. Evaluating line 1 Evaluating line 2 Evaluating line 3 diff --git a/test/ref/test_yisi_1_srl.out b/test/ref/test_yisi_1_srl.out index 90d4f47..796deca 100644 --- a/test/ref/test_yisi_1_srl.out +++ b/test/ref/test_yisi_1_srl.out @@ -29,8 +29,10 @@ Cluster null Loading pipeline from /home/das011/u/sandboxes/mateplus/models/srl-EMNLP14+fs-eng.model Loading reranker from /home/das011/u/sandboxes/mateplus/models/srl-EMNLP14+fs-eng.model Done. -Tokenizing/SRL-ing hyp ... Done. -Tokenizing/SRL-ing ref ... Done. +Reading hyp sents... Done. +Reading ref sents... Done. +Creating hyp srlgraphs... Done. +Creating ref srlgraphs... Done. Evaluating line 1 Evaluating line 2 Evaluating line 3 diff --git a/test/ref/test_yisi_2.out b/test/ref/test_yisi_2.out index e56b9ca..c648453 100644 --- a/test/ref/test_yisi_2.out +++ b/test/ref/test_yisi_2.out @@ -6,8 +6,10 @@ Size of voc: 500 Dimension: 300 Finished reading w2v model. Learning lex weight from test_hyp.en ... Done. Learning lex weight from test_inp.de ... Done. -Tokenizing/SRL-ing hyp ... Done. -Tokenizing/SRL-ing inp ... Done. +Reading hyp sents... Done. +Reading inp sents... Done. +Creating hyp srlgraphs... Done. +Creating inp srlgraphs... Done. Evaluating line 1 Evaluating line 2 Evaluating line 3 diff --git a/test/ref/test_yisi_2_srl.out b/test/ref/test_yisi_2_srl.out index c45f324..6913024 100644 --- a/test/ref/test_yisi_2_srl.out +++ b/test/ref/test_yisi_2_srl.out @@ -57,8 +57,10 @@ Cluster null Loading pipeline from /home/das011/u/sandboxes/mateplus/models/srl-EMNLP14+fs-ger.model Loading reranker from /home/das011/u/sandboxes/mateplus/models/srl-EMNLP14+fs-ger.model Done. -Tokenizing/SRL-ing hyp ... Done. -Tokenizing/SRL-ing inp ... Done. +Reading hyp sents... Done. +Reading inp sents... Done. +Creating hyp srlgraphs... Done. +Creating inp srlgraphs... Done. Evaluating line 1 Evaluating line 2 Evaluating line 3