some more refactor + readded many options + better hyperparameter def…

…aults
UppsalaNLP · Aug 2, 2018 · 530e5e2 · 530e5e2
1 parent acbd1a3
commit 530e5e2
Show file tree

Hide file tree

Showing 18 changed files with 1,005 additions and 761 deletions.
diff --git a/README.md b/README.md
@@ -1,63 +1,80 @@
-# UUParser
-## Transition based dependency parser for Universal Dependencies using BiLSTM feature extractors.
-This parser is based on [Eli Kiperwasser's transition-based parser](http://github.com/elikip/bist-parser).
+# UUParser: A transition-based dependency parser for Universal Dependencies
 
-We adapted the parser to Universal Dependencies as well as extended it as described in these papers:
+This parser is based on [Eli Kiperwasser's transition-based parser](http://github.com/elikip/bist-parser) using BiLSTM feature extractors.
+We adapted the parser to Universal Dependencies and extended it as described in these papers:
 
 * (Version 1.0) Adaptation to UD + removed POS tags from the input + added character vectors + use pseudo-projective:
->Miryam de Lhoneux, Yan Shao, Ali Basirat, Eliyahu Kiperwasser, Sara Stymne, Yoav Goldberg, and Joakim Nivre. 2017. From raw text to Universal Dependencies - look, no tags! In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics.
+>Miryam de Lhoneux, Yan Shao, Ali Basirat, Eliyahu Kiperwasser, Sara Stymne, Yoav Goldberg, and Joakim Nivre. 2017. From Raw Text to Universal Dependencies - Look, No Tags! In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.
 
-
-* (Version 2.0) Removed the need for pseudo-projective parsing by using swap and creating a partially dynamic oracle as described in:
+* (Version 2.0) Removed the need for pseudo-projective parsing by using a swap transition and creating a partially dynamic oracle as described in:
 >Miryam de Lhoneux, Sara Stymne and Joakim Nivre. 2017. Arc-Hybrid Non-Projective Dependency Parsing with a Static-Dynamic Oracle. In Proceedings of the The 15th International Conference on Parsing Technologies (IWPT).
 
-The techniques behind the original parser are described in the paper [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://www.transacl.org/ojs/index.php/tacl/article/viewFile/885/198). 
+* (Version 2.3) Added POS tags back in, extended cross-treebank functionality and use of external embeddings and some tuning of default hyperparameters:
+
+>Aaron Smith, Bernd Bohnet, Miryam de Lhoneux, Joakim Nivre, Yan Shao and Sara Stymne. 2018. 82 Treebanks, 34 Models: Universal Dependency Parsing with Cross-Treebank Models. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.
+
+The techniques behind the original parser are described in the paper [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://www.transacl.org/ojs/index.php/tacl/article/viewFile/885/198).
 
 #### Required software
 
  * Python 2.7 interpreter
  * [DyNet library](https://github.com/clab/dynet/tree/master/python)
 
-    Note: the current version is Dynet 2.0 but dynet 1.0 was used in both releases 1.0 and 2.0
+    Note: the current version is Dynet 2.0 but Dynet 1.0 was used in both releases 1.0 and 2.0
 
 
 #### Train a parsing model
 
 To train a set of parsing models for a set of treebanks:
 
-python src/parser.py --dynet-seed 123456789 --outdir [results directory] --datadir [directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --include [languages to include denoted by their ISO id] --epochs 30 --userlmost --dynet-mem 5000 --extrn [external word embeddings file]
+python src/parser.py --outdir [results directory] --datadir [directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --include [treebanks to include denoted by their ISO id]
+
+#### Options
+
+The parser has numerous options to allow you to fine-control its behaviour. For a full list, type:
+
+parser src/parser.py --help
+
+We recommend you set the --dynet-mem option to a larger number when running the full training procedure on larger treebanks.
+Commonly used values are 5000 and 10000 (in MB).
+
+Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.).
+To ensure identical results between two runs, we recommend setting the --dynet-seed option to the same value both times (e.g. --dynet-seed 123456789).
+This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.
+
+#### Example
 
-For optimal results you should add the following to the command prompt `--k 3 --usehead --userl`. These switch will set the stack to 3 elements; use the BiLSTM of the head of trees on the stack as feature vectors; and add the BiLSTM of the right/leftmost children to the feature vectors.
+The following is a typical command for training separate models for UD_Swedish, UD_Russian, and UD_English:
 
-#### Pick a model
-These commands save one model per epoch and evaluates it on the dev set.
-For prediction, the parser expects a model directory with one model and a parameter file for each language in a subdirectory.
-E.g.: models/en/barchybrid.model | models/en/params.pickle
-Before using your models for prediction, you probably want to pick one model per language. You can do this manually by looking at performance on the dev sets and copying the model and parameter files.
-Alternatively, you can use a script to do it for many languages at a time. This script expects a file with one iso code per line and a directory of trained models with evaluation on the dev sets. It creates a directory called 'models' and copies the best performing model on the dev set as measured by LAS. If models are found but no evaluation (some treebanks do not have a dev set), it picks the last epoch trained.
+python src/parser.py --outdir my_output --datadir ud-treebanks-v2.0 --include "sv ru en" --dynet-seed 123456789 --dynet-mem 10000
 
-```
-python scripts/pick_model.py iso_codes.txt dir_with_models
-```
+The output files will be created in my_output/sv, my_output/ru, and my_output/en.
+This command assumes that the directory UD_Swedish exists in ud-treebanks-v2.0 and contains at least the file sv-ud-train.conllu (and the same for the other two languages).
+If dev data is also found (sv-ud-dev.conllu), model selection will be performed by default by parsing the dev data at each epoch and choosing the model from the epoch with the highest LAS.
 
 #### Parse data with your parsing model
 
+python src/parser.py --predict --outdir [results directory] --datadir [directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --include [treebanks to include denoted by their ISO id]
 
-##### Input similar to the shared task setup (a list of conllu files with a metadata.json file describing their content)
+By default this will parse the dev data for the specified languages with the model files (by default barchybrid.model) found in treebank-specific subdirectories of outdir.
+Note that if you don't want to use the same directory for model files and output files, you can specify the --modeldir explictly.
+By default, is it assumed that --modeldir is the same as --outdir.
 
-python src/parser.py --predict --outdir [results directory] --modeldir [a directory containing one model per language] --datadir [input directory] --include [languages to include denoted by their ISO id] --pseudo-proj --shared_task --shared_task_datadir [the shared task input directory] --dynet-mem 5000
+#### Multi-treebank models
 
-##### Input has the same structure as the training data, and we take the dev files
+An important feature of the parser is the ability to train cross-treebank models by adding a treebank embedding.
+Information about this technique is detailed in:
 
-python src/parser.py --predict --outdir [results directory] --modeldir [a directory containing one model per language] --datadir [directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --include [languages to include denoted by their ISO id] --dynet-mem 5000
+>Sara Stymne, Miryam de Lhoneux, Aaron Smith, Joakim Nivre. 2018. Parser Training with Heterogeneous Treebanks. In Proceedings of ACL.
 
-The parser will store the resulting conll file in the out directory (`--outdir`).
+To train a multi-treebank model, simply add the --multiling flag at both training and test time.
+The output model files will be stored by default directly in the specified output directory rather than in treebank-specific subdirectories.
 
 #### Citation
 
 If you make use of this software for research purposes, we'll appreciate if you cite the following:
 
-If you use version 2.0:
+If you use version 2.0 or later:
 
     @InProceedings{delhoneux17arc,
         author    = {Miryam de Lhoneux and Sara Stymne and Joakim Nivre},

diff --git a/barchybrid/scripts/analysis_multimono.py b/barchybrid/scripts/analysis_multimono.py
@@ -0,0 +1,64 @@
+from __future__ import print_function
+import numpy as np
+import sys
+from optparse import OptionParser
+
+# example usage python analysis_multimono.py baselines_v2 3 byte_model 1 ar en fi
+# find average improvement on best dev epochs for second experiment over first on the listed languages
+# number after experiment name is the number of repeats of that experiment, the script
+# expects to find in this case baselines_v2-2 and baselines_v3-3 and uses all to 
+# calculate averages
+
+def main(options,args):
+
+    np.set_printoptions(precision=2)
+
+    # if experiment/baseline name is e.g baseline_v3, need multiple runs stored as baselines_v3-2, baselines_v3-3, etc
+    baseline = args[0] # name of baseline experiments
+    bas_runs = int(args[1]) # number of baseline experiments
+    exp_name = args[2] # name of new experiments
+    exp_runs = int(args[3]) # number of new experiments
+    print("Results for experiment: " + exp_name)
+
+    langs = args[4:]
+    bas_means = np.zeros((len(langs),))
+    exp_means = np.zeros((len(langs),))
+
+    for lang_counter in range(len(langs)):
+        bas_means[lang_counter] = get_lang_mean((baseline,"Bas"), langs[lang_counter], bas_runs, options)
+        exp_means[lang_counter] = get_lang_mean((exp_name,"Exp"), langs[lang_counter], exp_runs, options)
+        print("Gain: %.2f"%(exp_means[lang_counter]-bas_means[lang_counter]))
+
+    bas_mean = np.mean(bas_means)
+    exp_mean = np.mean(exp_means)
+    print("Means across all %i languages: Bas %.2f, Exp %.2f, Gain %.2f" %(len(langs),bas_mean, exp_mean, exp_mean - bas_mean))
+
+def get_lang_mean(exp_name, lang, no_runs, options):
+    if options.final_epochs:
+        print("%s: mean of last %i epochs from %i runs for %s: " %(exp_name[1], options.no_epochs, no_runs, lang),end='')
+    else:
+        print("%s: mean of best %i epochs from %i runs for %s: " %(exp_name[1], options.no_epochs, no_runs, lang),end='')
+    lang_means = np.zeros((no_runs,))
+    for ind in range(1,no_runs+1): # loop over other baseline experiments
+        if ind==1:
+            scores_file = "./%s/%s/%s_scores.txt"%(exp_name[0],lang,lang)
+        else:
+            scores_file = "./%s-%i/%s/%s_scores.txt"%(exp_name[0],ind,lang,lang)
+        scores = np.loadtxt(scores_file)
+        if not options.final_epochs:
+            scores = np.sort(scores)
+        run_mean = np.mean(scores[-options.no_epochs:])
+        lang_means[ind-1] = run_mean
+        print("%.2f "%run_mean,end='')
+    lang_mean = np.mean(lang_means)
+    print("(%.2f)"%lang_mean)
+    return lang_mean
+
+if __name__ == "__main__":
+
+    parser = OptionParser()
+    parser.add_option("--no-epochs", type="int", metavar="INTEGER", default=5, help='Number of epochs to use')
+    parser.add_option("--final-epochs", action="store_true", default=False, help='Use final rather than best epochs')
+    (options, args) = parser.parse_args()
+
+    main(options,args)
diff --git a/barchybrid/scripts/bash_script.sh b/barchybrid/scripts/bash_script.sh
diff --git a/barchybrid/scripts/best_res.sh b/barchybrid/scripts/best_res.sh
diff --git a/barchybrid/scripts/get_last_epoch.sh b/barchybrid/scripts/get_last_epoch.sh
diff --git a/barchybrid/scripts/json_parser.py b/barchybrid/scripts/json_parser.py
diff --git a/barchybrid/scripts/parse.sh b/barchybrid/scripts/parse.sh
diff --git a/barchybrid/scripts/parse_multi_monoling.sh b/barchybrid/scripts/parse_multi_monoling.sh
diff --git a/barchybrid/scripts/parse_multiling_option.sh b/barchybrid/scripts/parse_multiling_option.sh
diff --git a/barchybrid/scripts/parse_surprise_languages.sh b/barchybrid/scripts/parse_surprise_languages.sh