GitHub - annirvine/LIMTresults: Lexicon Induction & MT Results

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
byFreqGraphs		byFreqGraphs
byFreqMLGraphs		byFreqMLGraphs
techreport		techreport
.DS_Store		.DS_Store
.gitignore		.gitignore
README		README

Repository files navigation

Easy processing pipelines done up to the to-do line below. Note that not all output exists yet, but the jobs are running for all languages. All done for "az" example language.

A. byFreqGraphs:

Evaluation of lexicon induction framework for 24 languages (intersection of MTurk annotated Wikipedia dictionaries, with >100 translated words,  and our monolingual crawls)

By-language directory for each:

1. lang/lexinductnew.pdf : plot of frequency vs. accuracy in top-1, top-10, and top-100 outputs. If any MTurk worker has annotated a target word as a translation of a source language word, then it is considered correct. Translations are ranked using the mean reciprocal rank of the rankings given by our (1) contextual (all available dictionaries used for projection), (2) temporal, and (3) edit distance (if non-roman script language, words transliterated first) monolingual signals. Additional plots for each signal individually can be easily added. The 3000 most frequent words (wrt monolingual crawls) in our test dictionary were divided into 15 bins, by frequency. The average frequency of the words in each bin is plotted vs. the average translation accuracy of the words in each bin. If fewer than 3000 source language dictionary words appeared in the monolingual data, then the entire intersection of the two was used. The average corpus frequency (x-axis on the graphs) varies because (1) the test set size varied, as described, and (2) the size of the monolingual corpora varies.

2. By-Frequency translation dumps: For each of three runs (which are averaged for the graph above): on COE grid: /home/hltcoe/airvine/inducePhraseTable/LIMT/Experiments/LexInductExps/lang/runx/{aggmrr,context,edit,time}.scored contains a dump of the top 500 translations for all source language words translated in (1). Correct translations are starred. [Not included in github repo because filesizes are very large.]

----Not updated on COE cluster, not included in new paper draft
3. Summary statistic (accuracy in top k using one or aggregate of three monolingual signals) for *all* source language words in the MTurk dictionaries (regardless of frequency) in lang/summarystats
Note: dumps of translations for all source words available on a01:
/mnt/data/anni/Experiments/TranslateOOVs/$lang$/dictall/output
Correct translations are starred.


B. MachineTranslation

1a. Top-k translations for all OOV words wrt 5 Wikipedia articles (same method as above), in /mnt/data/anni/Experiments/TranslateOOVs/$lang$/dictplusoov/output
1b. Script to extract top-50 translations for OOVS only, from (1a) output above, output in /mnt/data/anni/Experiments/TranslateOOVs/$lang$/oovs.translated, ready to be added to scoreless phrase table (below)

2. End-to-End translations for each language (w/ OOVs translated using above lex induc framework)

3. Experiments with an MT test set: Wikipedia Indian languages (Spanish + Urdu if have time, but that monolingual data is so big)
(1) MT learning curve: From 100 -> all of training data available, doubling amount of training data at each point on the curve. 
(2) Same as above, plus monolingual scores.
(3) Pure monolingual system; monolingual scores on phrase tables including:
  - Using dictionaries only

----TO-DO line----

  - Using dictionaries + induced oov translations from (a) crawls, (b) wikipedia
(4) Full training data from (1), plus monolingual scores from (2), plus dictionaries and oov translations from (3) supplemented to training data. Expected to be best performing system. Could supplment dictionary and monolingual scores to varying amounts of training data, if time.


C. Documentation

1. Clean up my notes on running pipelines for producing the above, and include here
2. Make latex tables with data information (dictionary sizes, monolingual crawl data sizes, etc.) in "Babel Final Report Results" google spreadsheet.
3. Write some prose about experimental results

----

Additional Experiments:
1. Tune weights on contextual, temporal, orthographic, and wikipedia topic features and combine optimally, instead of just MRR. Training data - part of gold standard dictionary and scored features and correct and incorrect translation pairs
2. Vary amount of monolingual text, e.g. Farsi or Spanish, used to score dictionary translations.
3. Add transliteration (buckwalter or full) for non-roman script languages - DONE
4. Score stems of phrase table entries
5. Training side of Indian language corpus LMs from Matt
6. keep evaluation set the same, use full for contextual vector projection - DONE
7. scatter plot: amount of monolingual data vs. lexicon induction performance (e.g. fixed point at corpus freq = 1k and top 10 vs millions of monolingual tokens) - DONE
8. oovs in 5 representative wikipedia pages  - add wikipedia source pages with dummy time stamps to monolingual web crawl data (at least will get rid of all oovs, even if induced translations are nonsense) - DONE
9. learning curve over spanish monolingual data
10. Section 5 to add:  (1) dictionary PT translation w/o mono scores (2) learning curve for blue, green, red line (start with dictionary, then append training at each step)