-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 3c0316e
Showing
23 changed files
with
45,517 additions
and
0 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
## MT System Rankings for WMT20 English-Inuktitut | ||
This repository contains code and data for replicating rankings and correlations in the paper [Test Set Sampling Affects System Rankings: Expanded Human Evaluation of WMT20 English-Inuktitut Systems](https://www.statmt.org/wmt22/pdf/2022.wmt-1.8.pdf) by Rebecca Knowles and Chi-kiu Lo. | ||
|
||
## Requirements | ||
This code relies on SciPy (https://www.scipy.org), NumPy (https://numpy.org/), and MT Metrics Eval (https://github.com/google-research/mt-metrics-eval). Please follow the libraries' instructions for installation. For MT Metrics Eval, please also follow the instruction to download the database. | ||
|
||
## Data | ||
The data in this repository includes News data annotations (already publicly available from [WMT20](https://www.statmt.org/wmt20/results.html)) as well as additional annotations of Hansard data. | ||
For details, see the [DATA-README](data/DATA-README.md). | ||
|
||
## Code | ||
|
||
### System Rankings | ||
To reproduce the system rankings shown in Tables 3 and 4 of the paper, run `scripts/generate_rankings.sh`. The result should match the contents of the file `scripts/example_rankings.txt`. To match the clusterings exactly, you may need to use scipy version <=1.6.2 (note that this differs from the metric correlations code, which may run with newer versions of scipy). | ||
|
||
### Metric Correlations | ||
To reproduce the metrics correlation shown in Tables 6 and 7 of the paper, run `scripts/generate_correlations.sh`. The result should match the contents of the files `scripts/*.pearson` and `scripts/*.kendall`. | ||
|
||
## Copyright | ||
Multilingual Text Processing / Traitement multilingue de textes | ||
|
||
Digital Technologies Research Centre / Centre de recherche en technologies numériques | ||
|
||
National Research Council Canada / Conseil national de recherches Canada | ||
|
||
Copyright 2022, Sa Majesté le Roi du Chef du Canada / His Majesty the King in Right of Canada | ||
|
||
Published under the GPL v3.0 License (see [LICENSE](LICENSE)). | ||
|
||
## Cite | ||
If you use this code, you may wish to cite: | ||
|
||
``` | ||
@InProceedings{knowles-lo:2022:WMT, | ||
author = {Knowles, Rebecca and Lo, Chi-kiu}, | ||
title = {Test Set Sampling Affects System Rankings: Expanded Human Evaluation of {WMT20} {E}nglish-{I}nuktitut Systems}, | ||
booktitle = {Proceedings of the Seventh Conference on Machine Translation}, | ||
month = {December}, | ||
year = {2022}, | ||
address = {Abu Dhabi}, | ||
publisher = {Association for Computational Linguistics}, | ||
pages = {140--153}, | ||
url = {https://aclanthology.org/2022.wmt-1.8} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
## CSV Data Files | ||
The columns in the *.csv data files are as follows: | ||
annotator,hitid,system,segid,itemtype,src,tgt,score,docid,docscore,start,end | ||
|
||
- **annotator**: anonymous ID corresponding to a specific annotator | ||
- **hitid**: ID for the specific HIT | ||
- **system**: MT system | ||
- **segid**: 0-indexed segment ID (0 is the first segment in a document, in a document of n segments, segid n corresponds to the document-level score) | ||
- **itemtype**: TGT (valid score) or BAD (used for QA); see discussion below on how this interacts with docscores | ||
- **src**: source language | ||
- **tgt**: target language | ||
- **score**: score (0-100) for the segment | ||
- **docid**: document ID | ||
- **docscore**: False (indicates segment-level score) or True (indicates document-level score) | ||
- **start**: start time | ||
- **end**: end time | ||
|
||
|
||
## TSV Segment ID Mapping File | ||
The tab-separated columns in the segment_id_mapping.tsv file are: | ||
|
||
- **domain**: news or hansard | ||
- **original document ID**: as found in the original test set data release (sgm files) | ||
- **original segment ID**: 1-indexed segment ID as found in the original test set data release (sgm files) | ||
- **final document ID**: document ID used in the Appraise output (csv files in this directory); note that Hansard is split into pseudo-documents | ||
- **final segment ID**: 0-indexed segment ID; matches segment IDs in Appraise output csv files | ||
- **metrics segment ID**: 0-indexed segment ID used for the metrics task (segment ID according to segment order in original sgm files) | ||
|
||
|
||
## Document-level scores and itemtype | ||
|
||
All csv files contain both segment-level and document-level indications, with the latter indicated by a value of True in the docscore column. These are collected in the same interface, with the segment-level scores produced within document context, and the document-level scores entered after all of a document's segment-level scores are complete. | ||
|
||
We use only segment-level scores (docscore: False) in our work, but we provide all the data here for completeness. | ||
|
||
The Hansard data contains no quality assurance/quality control items, so all segment-level and document-level scores are labeled TGT. | ||
|
||
The News data does contain QA/QC items. At the segment-level, these are labeled with itemtype BAD. There are no document-level scores that are labeled BAD. However, an examination of the document level scores suggest that this may be a mislabeling; some document-level scores may be scores for documents containing BAD references (corrupted MT output which, whose scores compared to those with the uncorrupted output were used to remove annotations through QA/QC processes), and thus should be labeled BAD. We do not have access to the raw BAD reference text data to confirm this. We strongly urge caution in using any of the document-level News scores. |
Oops, something went wrong.