Skip to content

Commit

Permalink
some more refactor + readded many options + better hyperparameter def…
Browse files Browse the repository at this point in the history
…aults
  • Loading branch information
mdelhoneux committed Aug 2, 2018
1 parent acbd1a3 commit 530e5e2
Show file tree
Hide file tree
Showing 18 changed files with 1,005 additions and 761 deletions.
69 changes: 43 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,63 +1,80 @@
# UUParser
## Transition based dependency parser for Universal Dependencies using BiLSTM feature extractors.
This parser is based on [Eli Kiperwasser's transition-based parser](http://github.com/elikip/bist-parser).
# UUParser: A transition-based dependency parser for Universal Dependencies

We adapted the parser to Universal Dependencies as well as extended it as described in these papers:
This parser is based on [Eli Kiperwasser's transition-based parser](http://github.com/elikip/bist-parser) using BiLSTM feature extractors.
We adapted the parser to Universal Dependencies and extended it as described in these papers:

* (Version 1.0) Adaptation to UD + removed POS tags from the input + added character vectors + use pseudo-projective:
>Miryam de Lhoneux, Yan Shao, Ali Basirat, Eliyahu Kiperwasser, Sara Stymne, Yoav Goldberg, and Joakim Nivre. 2017. From raw text to Universal Dependencies - look, no tags! In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics.
>Miryam de Lhoneux, Yan Shao, Ali Basirat, Eliyahu Kiperwasser, Sara Stymne, Yoav Goldberg, and Joakim Nivre. 2017. From Raw Text to Universal Dependencies - Look, No Tags! In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.

* (Version 2.0) Removed the need for pseudo-projective parsing by using swap and creating a partially dynamic oracle as described in:
* (Version 2.0) Removed the need for pseudo-projective parsing by using a swap transition and creating a partially dynamic oracle as described in:
>Miryam de Lhoneux, Sara Stymne and Joakim Nivre. 2017. Arc-Hybrid Non-Projective Dependency Parsing with a Static-Dynamic Oracle. In Proceedings of the The 15th International Conference on Parsing Technologies (IWPT).
The techniques behind the original parser are described in the paper [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://www.transacl.org/ojs/index.php/tacl/article/viewFile/885/198).
* (Version 2.3) Added POS tags back in, extended cross-treebank functionality and use of external embeddings and some tuning of default hyperparameters:

>Aaron Smith, Bernd Bohnet, Miryam de Lhoneux, Joakim Nivre, Yan Shao and Sara Stymne. 2018. 82 Treebanks, 34 Models: Universal Dependency Parsing with Cross-Treebank Models. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.
The techniques behind the original parser are described in the paper [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://www.transacl.org/ojs/index.php/tacl/article/viewFile/885/198).

#### Required software

* Python 2.7 interpreter
* [DyNet library](https://github.com/clab/dynet/tree/master/python)

Note: the current version is Dynet 2.0 but dynet 1.0 was used in both releases 1.0 and 2.0
Note: the current version is Dynet 2.0 but Dynet 1.0 was used in both releases 1.0 and 2.0


#### Train a parsing model

To train a set of parsing models for a set of treebanks:

python src/parser.py --dynet-seed 123456789 --outdir [results directory] --datadir [directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --include [languages to include denoted by their ISO id] --epochs 30 --userlmost --dynet-mem 5000 --extrn [external word embeddings file]
python src/parser.py --outdir [results directory] --datadir [directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --include [treebanks to include denoted by their ISO id]

#### Options

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type:

parser src/parser.py --help

We recommend you set the --dynet-mem option to a larger number when running the full training procedure on larger treebanks.
Commonly used values are 5000 and 10000 (in MB).

Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.).
To ensure identical results between two runs, we recommend setting the --dynet-seed option to the same value both times (e.g. --dynet-seed 123456789).
This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.

#### Example

For optimal results you should add the following to the command prompt `--k 3 --usehead --userl`. These switch will set the stack to 3 elements; use the BiLSTM of the head of trees on the stack as feature vectors; and add the BiLSTM of the right/leftmost children to the feature vectors.
The following is a typical command for training separate models for UD_Swedish, UD_Russian, and UD_English:

#### Pick a model
These commands save one model per epoch and evaluates it on the dev set.
For prediction, the parser expects a model directory with one model and a parameter file for each language in a subdirectory.
E.g.: models/en/barchybrid.model | models/en/params.pickle
Before using your models for prediction, you probably want to pick one model per language. You can do this manually by looking at performance on the dev sets and copying the model and parameter files.
Alternatively, you can use a script to do it for many languages at a time. This script expects a file with one iso code per line and a directory of trained models with evaluation on the dev sets. It creates a directory called 'models' and copies the best performing model on the dev set as measured by LAS. If models are found but no evaluation (some treebanks do not have a dev set), it picks the last epoch trained.
python src/parser.py --outdir my_output --datadir ud-treebanks-v2.0 --include "sv ru en" --dynet-seed 123456789 --dynet-mem 10000

```
python scripts/pick_model.py iso_codes.txt dir_with_models
```
The output files will be created in my_output/sv, my_output/ru, and my_output/en.
This command assumes that the directory UD_Swedish exists in ud-treebanks-v2.0 and contains at least the file sv-ud-train.conllu (and the same for the other two languages).
If dev data is also found (sv-ud-dev.conllu), model selection will be performed by default by parsing the dev data at each epoch and choosing the model from the epoch with the highest LAS.

#### Parse data with your parsing model

python src/parser.py --predict --outdir [results directory] --datadir [directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --include [treebanks to include denoted by their ISO id]

##### Input similar to the shared task setup (a list of conllu files with a metadata.json file describing their content)
By default this will parse the dev data for the specified languages with the model files (by default barchybrid.model) found in treebank-specific subdirectories of outdir.
Note that if you don't want to use the same directory for model files and output files, you can specify the --modeldir explictly.
By default, is it assumed that --modeldir is the same as --outdir.

python src/parser.py --predict --outdir [results directory] --modeldir [a directory containing one model per language] --datadir [input directory] --include [languages to include denoted by their ISO id] --pseudo-proj --shared_task --shared_task_datadir [the shared task input directory] --dynet-mem 5000
#### Multi-treebank models

##### Input has the same structure as the training data, and we take the dev files
An important feature of the parser is the ability to train cross-treebank models by adding a treebank embedding.
Information about this technique is detailed in:

python src/parser.py --predict --outdir [results directory] --modeldir [a directory containing one model per language] --datadir [directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --include [languages to include denoted by their ISO id] --dynet-mem 5000
>Sara Stymne, Miryam de Lhoneux, Aaron Smith, Joakim Nivre. 2018. Parser Training with Heterogeneous Treebanks. In Proceedings of ACL.
The parser will store the resulting conll file in the out directory (`--outdir`).
To train a multi-treebank model, simply add the --multiling flag at both training and test time.
The output model files will be stored by default directly in the specified output directory rather than in treebank-specific subdirectories.

#### Citation

If you make use of this software for research purposes, we'll appreciate if you cite the following:

If you use version 2.0:
If you use version 2.0 or later:

@InProceedings{delhoneux17arc,
author = {Miryam de Lhoneux and Sara Stymne and Joakim Nivre},
Expand Down
64 changes: 64 additions & 0 deletions barchybrid/scripts/analysis_multimono.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
from __future__ import print_function
import numpy as np
import sys
from optparse import OptionParser

# example usage python analysis_multimono.py baselines_v2 3 byte_model 1 ar en fi
# find average improvement on best dev epochs for second experiment over first on the listed languages
# number after experiment name is the number of repeats of that experiment, the script
# expects to find in this case baselines_v2-2 and baselines_v3-3 and uses all to
# calculate averages

def main(options,args):

np.set_printoptions(precision=2)

# if experiment/baseline name is e.g baseline_v3, need multiple runs stored as baselines_v3-2, baselines_v3-3, etc
baseline = args[0] # name of baseline experiments
bas_runs = int(args[1]) # number of baseline experiments
exp_name = args[2] # name of new experiments
exp_runs = int(args[3]) # number of new experiments
print("Results for experiment: " + exp_name)

langs = args[4:]
bas_means = np.zeros((len(langs),))
exp_means = np.zeros((len(langs),))

for lang_counter in range(len(langs)):
bas_means[lang_counter] = get_lang_mean((baseline,"Bas"), langs[lang_counter], bas_runs, options)
exp_means[lang_counter] = get_lang_mean((exp_name,"Exp"), langs[lang_counter], exp_runs, options)
print("Gain: %.2f"%(exp_means[lang_counter]-bas_means[lang_counter]))

bas_mean = np.mean(bas_means)
exp_mean = np.mean(exp_means)
print("Means across all %i languages: Bas %.2f, Exp %.2f, Gain %.2f" %(len(langs),bas_mean, exp_mean, exp_mean - bas_mean))

def get_lang_mean(exp_name, lang, no_runs, options):
if options.final_epochs:
print("%s: mean of last %i epochs from %i runs for %s: " %(exp_name[1], options.no_epochs, no_runs, lang),end='')
else:
print("%s: mean of best %i epochs from %i runs for %s: " %(exp_name[1], options.no_epochs, no_runs, lang),end='')
lang_means = np.zeros((no_runs,))
for ind in range(1,no_runs+1): # loop over other baseline experiments
if ind==1:
scores_file = "./%s/%s/%s_scores.txt"%(exp_name[0],lang,lang)
else:
scores_file = "./%s-%i/%s/%s_scores.txt"%(exp_name[0],ind,lang,lang)
scores = np.loadtxt(scores_file)
if not options.final_epochs:
scores = np.sort(scores)
run_mean = np.mean(scores[-options.no_epochs:])
lang_means[ind-1] = run_mean
print("%.2f "%run_mean,end='')
lang_mean = np.mean(lang_means)
print("(%.2f)"%lang_mean)
return lang_mean

if __name__ == "__main__":

parser = OptionParser()
parser.add_option("--no-epochs", type="int", metavar="INTEGER", default=5, help='Number of epochs to use')
parser.add_option("--final-epochs", action="store_true", default=False, help='Use final rather than best epochs')
(options, args) = parser.parse_args()

main(options,args)
21 changes: 0 additions & 21 deletions barchybrid/scripts/bash_script.sh

This file was deleted.

12 changes: 0 additions & 12 deletions barchybrid/scripts/best_res.sh

This file was deleted.

10 changes: 0 additions & 10 deletions barchybrid/scripts/get_last_epoch.sh

This file was deleted.

84 changes: 0 additions & 84 deletions barchybrid/scripts/json_parser.py

This file was deleted.

10 changes: 0 additions & 10 deletions barchybrid/scripts/parse.sh

This file was deleted.

13 changes: 0 additions & 13 deletions barchybrid/scripts/parse_multi_monoling.sh

This file was deleted.

10 changes: 0 additions & 10 deletions barchybrid/scripts/parse_multiling_option.sh

This file was deleted.

13 changes: 0 additions & 13 deletions barchybrid/scripts/parse_surprise_languages.sh

This file was deleted.

Loading

0 comments on commit 530e5e2

Please sign in to comment.