-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
64 lines (43 loc) · 2.75 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
Transliteration training and decoding with sample Urdu data
1. If using a mac, update SRILM call in train_helper (comment out line 72, uncomment line 73)
If using CLSP or COE clusters, do nothing.
2. Set JOSHUA, SRILM, and JAVA_HOME environment variables appropriately (or use my pointers at the top of train_helper and test_helper for coe machines)
3. To train a system, create directory with a file containing a "train" file and a "dev" file with transliteration pairs, then run (pointing to training directory):
./train_main.sh demo_data/urdu
Example directory is demo_data/urdu (note that these example train and dev files already have beginning of word and end of word symbols appended)
4. To decode a test set, run:
./translit_strings.sh demo_data/urdu/urdu_test demo_data/urdu
First arg: test file (list of single words to decode)
Second arg: pointer to same directory given in training step, above
Output in demo_data/urdu/urdu_test.answer
For this example, references are given in demo_data/urdu/urdu_test.ref for comparison
5. Evaluate your output:
python evaluate.py demo_data/urdu/urdu_test.answer demo_data/urdu/urdu_test.ref
This simple evaluation script reports the number and percent of perfect transliterations, the average edit distance, and the average normalized (normalized by the length of the reference) edit distance
-------------------------------------
Using your own data
train/dev data:
To use your own, just need to create directory with a train and dev file. Script to append beginning and word and end of word symbols: addSymTrainDev.py (run: python addSymTrainDev.py inputTrain)
language model data:
Two English language model files are included in demo_data/lm
One is taken from Wikipedia person-page titles and the other from NE-labeled Gigaword corpus, and each includes relative frequency counts.
train your own LM:
Can use other LM, just change pointer in train_main.sh (and script for adding beginning of word and end of word symbols in demo_data/lm)
to build from monolingual corpus:
cd demo_data/lm
python make_lm.py my-input-text
Change LM pointer to my-input-text.wordModel
-------------------------------------
Using Wikipedia data:
See wikipedia_data/README
-------------------------------------
Substitute transliterations into Joshua MT output
To find OOVs in Joshua MT output: find-oovs.pl
After generating transliterations for all OOV words, create a dictionary file (oov word - tab - transliterated word), and substitute them into the Joshua output:
python sub-oovs.py tab-separated-oov-dictionary translation-output-file
-------------------------------------
If you use this system, please cite the following:
http://www.clsp.jhu.edu/~anni/papers/irvine_amta_translit.pdf
-------------------------------------
Questions: