Skip to content
This repository was archived by the owner on Mar 19, 2021. It is now read-only.

Commit d9db8c9

Browse files
committed
init
0 parents  commit d9db8c9

29 files changed

+4071
-0
lines changed

.gitattributes

+63
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
###############################################################################
2+
# Set default behavior to automatically normalize line endings.
3+
###############################################################################
4+
* text=auto
5+
6+
###############################################################################
7+
# Set default behavior for command prompt diff.
8+
#
9+
# This is need for earlier builds of msysgit that does not have it on by
10+
# default for csharp files.
11+
# Note: This is only used by command line
12+
###############################################################################
13+
#*.cs diff=csharp
14+
15+
###############################################################################
16+
# Set the merge driver for project and solution files
17+
#
18+
# Merging from the command prompt will add diff markers to the files if there
19+
# are conflicts (Merging from VS is not affected by the settings below, in VS
20+
# the diff markers are never inserted). Diff markers may cause the following
21+
# file extensions to fail to load in VS. An alternative would be to treat
22+
# these files as binary and thus will always conflict and require user
23+
# intervention with every merge. To do so, just uncomment the entries below
24+
###############################################################################
25+
#*.sln merge=binary
26+
#*.csproj merge=binary
27+
#*.vbproj merge=binary
28+
#*.vcxproj merge=binary
29+
#*.vcproj merge=binary
30+
#*.dbproj merge=binary
31+
#*.fsproj merge=binary
32+
#*.lsproj merge=binary
33+
#*.wixproj merge=binary
34+
#*.modelproj merge=binary
35+
#*.sqlproj merge=binary
36+
#*.wwaproj merge=binary
37+
38+
###############################################################################
39+
# behavior for image files
40+
#
41+
# image files are treated as binary by default.
42+
###############################################################################
43+
#*.jpg binary
44+
#*.png binary
45+
#*.gif binary
46+
47+
###############################################################################
48+
# diff behavior for common document formats
49+
#
50+
# Convert binary document formats to text before diffing them. This feature
51+
# is only available from the command line. Turn it on by uncommenting the
52+
# entries below.
53+
###############################################################################
54+
#*.doc diff=astextplain
55+
#*.DOC diff=astextplain
56+
#*.docx diff=astextplain
57+
#*.DOCX diff=astextplain
58+
#*.dot diff=astextplain
59+
#*.DOT diff=astextplain
60+
#*.pdf diff=astextplain
61+
#*.PDF diff=astextplain
62+
#*.rtf diff=astextplain
63+
#*.RTF diff=astextplain

.gitignore

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
pred.txt
2+
multi-bleu.perl
3+
*.pt
4+
*.pyc

LICENSE

+674
Large diffs are not rendered by default.

README.md

+98
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# NQG
2+
This repository contains code for the paper "[Neural Question Generation from Text: A Preliminary Study](https://arxiv.org/abs/1704.01792)"
3+
4+
## About this code
5+
6+
The experiments in the paper were done with an in-house deep learning tool. Therefore, we re-implement this with PyTorch as a reference.
7+
8+
This code only implements the setting `NQG+` in the paper.
9+
Within 1 hour's training on Tesla P100, the `NQG+` model achieves 12.35 BLEU-4 score on the dev set as reported in our paper.
10+
11+
If you find this code useful in your research, please consider citing:
12+
13+
@article{zhou2017neural,
14+
title={Neural Question Generation from Text: A Preliminary Study},
15+
author={Zhou, Qingyu and Yang, Nan and Wei, Furu and Tan, Chuanqi and Bao, Hangbo and Zhou, Ming},
16+
journal={arXiv preprint arXiv:1704.01792},
17+
year={2017}
18+
}
19+
20+
21+
22+
## How to run
23+
24+
### Prepare the dataset and code
25+
26+
Make a experiment home folder for NQG data and code:
27+
```bash
28+
NQG_HOME=~/workspace/nqg
29+
mkdir -p $NQG_HOME/code
30+
mkdir -p $NQG_HOME/data
31+
cd $NQG_HOME/code
32+
git clone https://github.com/magic282/NQG.git
33+
cd $NQG_HOME/data
34+
wget https://res.qyzhou.me/redistribute.zip
35+
unzip redistribute.zip
36+
```
37+
Put the data in the folder `$NQG_HOME/code/data/giga` and organize them as:
38+
```
39+
nqg
40+
├── code
41+
│   └── NQG
42+
│   └── seq2seq_pt
43+
└── data
44+
└── redistribute
45+
├── QG
46+
│   ├── dev
47+
│   ├── test
48+
│   ├── test_sample
49+
│   └── train
50+
└── raw
51+
```
52+
Then collect vocabularies:
53+
```bash
54+
python $NQG_HOME/code/NQG/seq2seq_pt/CollectVocab.py \
55+
$NQG_HOME/data/redistribute/QG/train/train.txt.source.txt \
56+
$NQG_HOME/data/redistribute/QG/train/train.txt.target.txt \
57+
$NQG_HOME/data/redistribute/QG/train/vocab.txt
58+
python $NQG_HOME/code/NQG/seq2seq_pt/CollectVocab.py \
59+
$NQG_HOME/data/redistribute/QG/train/train.txt.bio \
60+
$NQG_HOME/data/redistribute/QG/train/bio.vocab.txt
61+
python $NQG_HOME/code/NQG/seq2seq_pt/CollectVocab.py \
62+
$NQG_HOME/data/redistribute/QG/train/train.txt.pos \
63+
$NQG_HOME/data/redistribute/QG/train/train.txt.ner \
64+
$NQG_HOME/data/redistribute/QG/train/train.txt.case \
65+
$NQG_HOME/data/redistribute/QG/train/feat.vocab.txt
66+
head -n 20000 $NQG_HOME/data/redistribute/QG/train/vocab.txt > $NQG_HOME/data/redistribute/QG/train/vocab.txt.20k
67+
```
68+
69+
### Setup the environment
70+
#### Package Requirements:
71+
```
72+
nltk scipy numpy pytorch
73+
```
74+
**PyTorch version**: This code requires PyTorch v0.4.0.
75+
76+
**Python version**: This code requires Python3.
77+
78+
**Warning**: Older versions of NLTK have a bug in the PorterStemmer. Therefore, a fresh installation or update of NLTK is recommended.
79+
80+
A Docker image is also provided.
81+
#### Docker image
82+
```bash
83+
docker pull magic282/pytorch:0.4.0
84+
```
85+
### Run training
86+
The file `run.sh` is an example. Modify it according to your configuration.
87+
#### Without Docker
88+
```bash
89+
bash $NQG_HOME/code/NQG/seq2seq_pt/run_squad_qg.sh $NQG_HOME/data/redistribute/QG $NQG_HOME/code/NQG/seq2seq_pt
90+
```
91+
#### With Docker
92+
```bash
93+
nvidia-docker run --rm -ti -v $NQG_HOME:/workspace magic282/pytorch:0.4.0
94+
```
95+
Then inside the docker:
96+
```bash
97+
bash code/NQG/seq2seq_pt/run_squad_qg.sh /workspace/data/redistribute/QG /workspace/code/NQG/seq2seq_pt
98+
```

seq2seq_pt/CollectVocab.py

+61
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
from __future__ import division
2+
import sys
3+
import operator
4+
5+
DefaultSpecialWords = ["<blank>", "<unk>", "<s>", "</s>"]
6+
7+
8+
def Collect(inputFiles, vocabPath, toLower=False, userDefineSpecial=None):
9+
global DefaultSpecialWords
10+
specialWords = []
11+
if userDefineSpecial:
12+
for item in userDefineSpecial:
13+
if item not in specialWords:
14+
specialWords.append(item)
15+
else:
16+
specialWords = DefaultSpecialWords
17+
18+
dict = CollectVocab(inputFiles, toLower)
19+
total = sum(dict.values())
20+
sorted_dict = sorted(dict.items(), key=operator.itemgetter(1), reverse=True)
21+
acc = 0
22+
with open(vocabPath, 'w', encoding='utf-8') as sw:
23+
count = 0
24+
for item in specialWords:
25+
sw.write("{0} {1}\n".format(item, count))
26+
count += 1
27+
for k, v in sorted_dict:
28+
if k in specialWords:
29+
continue
30+
acc += v
31+
sw.write("{0} {1} {2} {3}\n".format(k, count, v, 1.0 * acc / total))
32+
count += 1
33+
34+
35+
def CollectVocab(files, toLower):
36+
dict = {}
37+
for f in files:
38+
39+
with open(f, encoding='utf-8') as sr:
40+
for line in sr:
41+
line = line.strip()
42+
if toLower:
43+
line = line.lower()
44+
sp = line.split()
45+
sp = filter(None, sp)
46+
for token in sp:
47+
if token not in dict:
48+
dict[token] = 0
49+
dict[token] += 1
50+
return dict
51+
52+
53+
if __name__ == "__main__":
54+
if len(sys.argv) >= 3:
55+
files = sys.argv[1:-1]
56+
vocab_file = sys.argv[-1]
57+
Collect(files, vocab_file, False, ["<blank>", "<unk>", "<s>", "</s>"])
58+
else:
59+
print('CollectVocab.py: Collect vocabulary from multiple files.')
60+
print('Usage:')
61+
print('python CollectVocab.py file_1 file_2 ... file_n out.vocab.txt')

seq2seq_pt/PyBLEU/__init__.py

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
from __future__ import absolute_import
2+
import nltk_bleu_score
3+
4+
__version__ = "0.0.1"

0 commit comments

Comments
 (0)