Skip to content

Commit 98dc7c9

Browse files
committed
Put RNNG code under src
1 parent 07de8b5 commit 98dc7c9

File tree

175 files changed

+296842
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

175 files changed

+296842
-0
lines changed

src/.gitignore

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
build/*
2+
EVALB
3+
4+
# Python cache files
5+
*.pyc
6+
7+
# Compiled Object files
8+
*.slo
9+
*.lo
10+
*.o
11+
*.obj
12+
13+
# Precompiled Headers
14+
*.gch
15+
*.pch
16+
17+
# Compiled Dynamic libraries
18+
*.so
19+
*.dylib
20+
*.dll
21+
22+
# Fortran module files
23+
*.mod
24+
25+
# Compiled Static libraries
26+
*.lai
27+
*.la
28+
*.a
29+
*.lib
30+
31+
# Executables
32+
*.exe
33+
*.out
34+
*.app

src/CMakeLists.txt

+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
project(cnn)
2+
cmake_minimum_required(VERSION 2.8 FATAL_ERROR)
3+
4+
set(CMAKE_MODULE_PATH ${PROJECT_SOURCE_DIR}/cmake)
5+
set(CMAKE_CXX_FLAGS "-Wall -std=c++11 -O3 -g")
6+
7+
enable_testing()
8+
9+
include_directories(${CMAKE_CURRENT_SOURCE_DIR})
10+
include_directories(${CMAKE_CURRENT_SOURCE_DIR}/cnn)
11+
set(WITH_EIGEN_BACKEND 1)
12+
13+
# look for Boost
14+
set(Boost_REALPATH ON)
15+
find_package(Boost COMPONENTS program_options iostreams serialization REQUIRED)
16+
include_directories(${Boost_INCLUDE_DIR})
17+
set(LIBS ${LIBS} ${Boost_LIBRARIES})
18+
19+
# look for Eigen
20+
find_package(Eigen3 REQUIRED)
21+
include_directories(${EIGEN3_INCLUDE_DIR})
22+
23+
#configure_file(${CMAKE_CURRENT_SOURCE_DIR}/config.h.cmake ${CMAKE_CURRENT_BINARY_DIR}/config.h)
24+
25+
add_subdirectory(cnn/cnn)
26+
add_subdirectory(nt-parser)
27+
# add_subdirectory(cnn/examples)
28+

src/README.md

+122
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# NOTE
2+
This code is originally an [implementation of Recurrent Neural Network Grammars from CMU](https://github.com/clab/rnng/). Before put under this repository, this code is forked and modified under [this repository](http://github.com/kmkurn/rnng). Any changes from the original code are thus can be viewed in the commit history.
3+
4+
# Recurrent Neural Network Grammars
5+
Code for the [Recurrent Neural Network Grammars](https://arxiv.org/abs/1602.07776) paper (NAACL 2016), by Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith, after the Corrigendum (last two pages on the ArXiv version of the paper). The code is written in C++.
6+
7+
# Citation
8+
@inproceedings{dyer-rnng:16,
9+
author = {Chris Dyer and Adhiguna Kuncoro and Miguel Ballesteros and Noah A. Smith},
10+
title = {Recurrent Neural Network Grammars},
11+
booktitle = {Proc. of NAACL},
12+
year = {2016},
13+
}
14+
15+
# Prerequisites
16+
* A C++ compiler supporting the [C++11 language standard](https://en.wikipedia.org/wiki/C%2B%2B11)
17+
* [Boost](http://www.boost.org/) libraries
18+
* [Eigen](http://eigen.tuxfamily.org) (latest development release)
19+
* [CMake](http://www.cmake.org/)
20+
* [EVALB](http://nlp.cs.nyu.edu/evalb/) (latest version. IMPORTANT: please put the EVALB folder on the same directory as `get_oracle.py` and `sample_input_chinese.txt` to ensure compatibility)
21+
22+
# Build instructions
23+
Assuming the latest development version of Eigen is stored at: /opt/tools/eigen-dev
24+
25+
mkdir build
26+
cd build
27+
cmake -DEIGEN3_INCLUDE_DIR=/opt/tools/eigen-dev ..
28+
make -j2
29+
30+
31+
# Sample input format:
32+
`sample_input_english.txt` (English PTB) and `sample_input_chinese.txt` (Chinese CTB)
33+
34+
# Oracles
35+
The oracle converts the bracketed phrase-structure tree into a sequence of actions.
36+
The script to obtain the oracle also converts singletons in the training set and unknown words in the dev and test set into a fine-grained set of 'UNK' symbols
37+
38+
### Obtaining the oracle for the discriminative model
39+
python get_oracle.py [training file] [training file] > train.oracle
40+
python get_oracle.py [training file] [dev file] > dev.oracle
41+
python get_oracle.py [training file] [test file] > test.oracle
42+
43+
### Obtaining the oracle for the generative model
44+
python get_oracle_gen.py [training file] [training file] > train_gen.oracle
45+
python get_oracle_gen.py [training file] [dev file] > dev_gen.oracle
46+
python get_oracle_gen.py [training file] [test file] > test_gen.oracle
47+
48+
# Discriminative Model
49+
The discriminative variant of the RNNG is used as a proposal distribution for decoding the generative model (although it can also be used for decoding on its own). To save time we recommend training both models in parallel.
50+
51+
On the English PTB dataset the discriminative model typically converges after about 3 days on a single-core CPU device.
52+
53+
### Training the discriminative model
54+
55+
nohup build/nt-parser/nt-parser --cnn-mem 1700 -x -T [training_oracle_file] -d [dev_oracle_file] -C [original_dev_file (PTB bracketed format, see sample_input_english.txt)] -P -t --pretrained_dim [dimension of pre-trained word embedding] -w [pre-trained word embedding] --lstm_input_dim 128 --hidden_dim 128 -D 0.2 > log.txt
56+
57+
IMPORTANT: please run the command at the same folder where `remove_dev_unk.py` is located.
58+
59+
If not using pre-trained word embedding, then remove the `--pretrained_dim` and `-w` flags.
60+
61+
The training log is printed to `log.txt` (including information on where the parameter file for the model is saved to, which is used for decoding under the -m option below)
62+
63+
### Decoding with discriminative model
64+
65+
build/nt-parser/nt-parser --cnn-mem 1700 -x -T [training_oracle_file] -p [test_oracle_file] -C [original_test_file (PTB bracketed format, see sample_input_english.txt)] -P --pretrained_dim [dimension of pre-trained word embedding] -w [pre-trained word embedding] --lstm_input_dim 128 --hidden_dim 128 -m [parameter file] > output.txt
66+
67+
Note: the output will be stored in `/tmp/parse/parser_test_eval.xxxx.txt` and the parser will output F1 score calculated with EVALB with COLLINS.prm option. The parameter file (following the -m in the command above) can be obtained from `log.txt`.
68+
69+
If training was done using pre-trained word embedding (by specifying the -w and --pretrained\_dim options) or POS tags (-P option), then decoding must alo use the exact same options used for training.
70+
71+
# Generative Model
72+
The generative model achieved state of the art results, and decoding is done using sampled trees from the trained discriminative model
73+
For the best results the generative model takes about 7 days to converge.
74+
75+
### Training the generative model
76+
nohup build/nt-parser/nt-parser-gen -x -T [training_oracle_generative] -d [dev_oracle_generative] -t --clusters clusters-train-berk.txt --input_dim 256 --lstm_input_dim 256 --hidden_dim 256 -D 0.3 > log_gen.txt
77+
78+
The training log is printed to `log_gen.txt`, including information on where the parameters of the model is saved to, which is used for decoding later.
79+
80+
# Decoding with the generative model
81+
Decoding with the generative model requires sample trees from the trained discriminative model
82+
83+
### Sampling trees from the discriminative model
84+
build/nt-parser/nt-parser --cnn-mem 1700 -x -T [training_oracle_file] -p [test_oracle_file] -C [original_test_file (PTB bracketed format, see sample_input_english.txt)] -P --pretrained_dim [dimension of pre-trained word embedding] -w [pre-trained word embedding] --lstm_input_dim 128 --hidden_dim 128 -m [parameter file of trained discriminative model] --alpha 0.8 -s 100 > test-samples.props
85+
86+
important parameters
87+
88+
* s = # of samples (all reported results used 100)
89+
* alpha = posterior scaling (since this is a proposal, a higher entropy distribution is probably good, so a value < 1 is sensible. All reported results used 0.8)
90+
91+
### Prepare samples for likelihood evaluation
92+
93+
utils/cut-corpus.pl 3 test-samples.props > test-samples.trees
94+
95+
### Evaluate joint likelihood under generative model
96+
97+
build/nt-parser/nt-parser-gen -x -T [training_oracle_generative] --clusters clusters-train-berk.txt --input_dim 256 --lstm_input_dim 256 --hidden_dim 256 -p test-samples.trees -m [parameters file from the trained generative model, see log_gen.txt] > test-samples.likelihoods
98+
99+
### Estimate marginal likelihood (final step to get language modeling ppl)
100+
101+
utils/is-estimate-marginal-llh.pl 2416 100 test-samples.props test-samples.likelihoods > llh.txt 2> rescored.trees
102+
103+
* 100 = # of samples
104+
* 2416 = # of sentences in test set
105+
* `rescored.trees` will contain the samples reranked by p(x,y)
106+
107+
The file `llh.txt` would contain the final language modeling perplexity after marginalization (see the last lines of the file)
108+
109+
### Compute generative model parsing accuracy (final step to get parsing accuracy from the generative model)
110+
111+
utils/add-fake-preterms-for-eval.pl rescored.trees > rescored.preterm.trees
112+
utils/replace-unks-in-trees.pl [Discriminative oracle for the test file] rescored.preterm.trees > hyp.trees
113+
utils/remove_dev_unk.py [gold trees on the test set (same format as sample_input_english.txt)] hyp.trees > hyp_final.trees
114+
EVALB/evalb -p COLLINS.prm [gold trees on the test set (same format as sample_input_english.txt)] hyp_final.trees > parsing_result.txt
115+
116+
The file `parsing_result.txt` contains the final parsing accuracy using EVALB
117+
118+
# Contact
119+
If there are any issues, please let us know at adhiguna.kuncoro [ AT SYMBOL ] gmail.com, miguel.ballesteros [AT SYMBOL] ibm.com, and cdyer [AT SYMBOL] cs.cmu.edu
120+
121+
# License
122+
This software is released under the terms of the [Apache License, Version 2.0] (http://www.apache.org/licenses/LICENSE-2.0)

0 commit comments

Comments
 (0)