Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
rlknowles committed Dec 2, 2022
0 parents commit 3c0316e
Show file tree
Hide file tree
Showing 23 changed files with 45,517 additions and 0 deletions.
674 changes: 674 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

45 changes: 45 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
## MT System Rankings for WMT20 English-Inuktitut
This repository contains code and data for replicating rankings and correlations in the paper [Test Set Sampling Affects System Rankings: Expanded Human Evaluation of WMT20 English-Inuktitut Systems](https://www.statmt.org/wmt22/pdf/2022.wmt-1.8.pdf) by Rebecca Knowles and Chi-kiu Lo.

## Requirements
This code relies on SciPy (https://www.scipy.org), NumPy (https://numpy.org/), and MT Metrics Eval (https://github.com/google-research/mt-metrics-eval). Please follow the libraries' instructions for installation. For MT Metrics Eval, please also follow the instruction to download the database.

## Data
The data in this repository includes News data annotations (already publicly available from [WMT20](https://www.statmt.org/wmt20/results.html)) as well as additional annotations of Hansard data.
For details, see the [DATA-README](data/DATA-README.md).

## Code

### System Rankings
To reproduce the system rankings shown in Tables 3 and 4 of the paper, run `scripts/generate_rankings.sh`. The result should match the contents of the file `scripts/example_rankings.txt`. To match the clusterings exactly, you may need to use scipy version <=1.6.2 (note that this differs from the metric correlations code, which may run with newer versions of scipy).

### Metric Correlations
To reproduce the metrics correlation shown in Tables 6 and 7 of the paper, run `scripts/generate_correlations.sh`. The result should match the contents of the files `scripts/*.pearson` and `scripts/*.kendall`.

## Copyright
Multilingual Text Processing / Traitement multilingue de textes

Digital Technologies Research Centre / Centre de recherche en technologies numériques

National Research Council Canada / Conseil national de recherches Canada

Copyright 2022, Sa Majesté le Roi du Chef du Canada / His Majesty the King in Right of Canada

Published under the GPL v3.0 License (see [LICENSE](LICENSE)).

## Cite
If you use this code, you may wish to cite:

```
@InProceedings{knowles-lo:2022:WMT,
author = {Knowles, Rebecca and Lo, Chi-kiu},
title = {Test Set Sampling Affects System Rankings: Expanded Human Evaluation of {WMT20} {E}nglish-{I}nuktitut Systems},
booktitle = {Proceedings of the Seventh Conference on Machine Translation},
month = {December},
year = {2022},
address = {Abu Dhabi},
publisher = {Association for Computational Linguistics},
pages = {140--153},
url = {https://aclanthology.org/2022.wmt-1.8}
}
```
38 changes: 38 additions & 0 deletions data/DATA-README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
## CSV Data Files
The columns in the *.csv data files are as follows:
annotator,hitid,system,segid,itemtype,src,tgt,score,docid,docscore,start,end

- **annotator**: anonymous ID corresponding to a specific annotator
- **hitid**: ID for the specific HIT
- **system**: MT system
- **segid**: 0-indexed segment ID (0 is the first segment in a document, in a document of n segments, segid n corresponds to the document-level score)
- **itemtype**: TGT (valid score) or BAD (used for QA); see discussion below on how this interacts with docscores
- **src**: source language
- **tgt**: target language
- **score**: score (0-100) for the segment
- **docid**: document ID
- **docscore**: False (indicates segment-level score) or True (indicates document-level score)
- **start**: start time
- **end**: end time


## TSV Segment ID Mapping File
The tab-separated columns in the segment_id_mapping.tsv file are:

- **domain**: news or hansard
- **original document ID**: as found in the original test set data release (sgm files)
- **original segment ID**: 1-indexed segment ID as found in the original test set data release (sgm files)
- **final document ID**: document ID used in the Appraise output (csv files in this directory); note that Hansard is split into pseudo-documents
- **final segment ID**: 0-indexed segment ID; matches segment IDs in Appraise output csv files
- **metrics segment ID**: 0-indexed segment ID used for the metrics task (segment ID according to segment order in original sgm files)


## Document-level scores and itemtype

All csv files contain both segment-level and document-level indications, with the latter indicated by a value of True in the docscore column. These are collected in the same interface, with the segment-level scores produced within document context, and the document-level scores entered after all of a document's segment-level scores are complete.

We use only segment-level scores (docscore: False) in our work, but we provide all the data here for completeness.

The Hansard data contains no quality assurance/quality control items, so all segment-level and document-level scores are labeled TGT.

The News data does contain QA/QC items. At the segment-level, these are labeled with itemtype BAD. There are no document-level scores that are labeled BAD. However, an examination of the document level scores suggest that this may be a mislabeling; some document-level scores may be scores for documents containing BAD references (corrupted MT output which, whose scores compared to those with the uncorrupted output were used to remove annotations through QA/QC processes), and thus should be labeled BAD. We do not have access to the raw BAD reference text data to confirm this. We strongly urge caution in using any of the document-level News scores.
Loading

0 comments on commit 3c0316e

Please sign in to comment.