Semi-Automatic NLI Data Collection

This is a repository for data and code accompanying paper "Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options".

Datasets

The five datasets described in the paper are available under data/ directory: base_news, base_wiki, sim_news, sim_wiki, and translate_wiki. Each of the dataset comes with a training set and a test set, both in .jsonl format. Please refer to the paper for the statistics for each of dataset.

License

We use premises taken from the English Gigaword Fifth Edition, English Wikipedia and Simple Wikipedia (downloaded May 2020), and WikiMatrix. The English Gigaword is distributed under the LDC User Agreement license. Wikipedia is licensed under Creative Commons Attribution-ShareAlike 3.0 Unported License (CC-BY-SA) and the GNU Free Documentation License (GFDL).

Experiments

Code used for the experiments for the paper can be found under scripts. Please follow README in each sub-directory for more details. For experiments using jiant (we use v1.2), please follow the documentation for installation and instructions.

Citation

@inproceedings{vania2020asking,
    title = "{Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options}",
    author = "Vania, Clara  and
      Chen, Ruijie  and
      Bowman, Samuel R.",
    booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
    month = dec,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Semi-Automatic NLI Data Collection

Datasets

License

Experiments

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Semi-Automatic NLI Data Collection

Datasets

License

Experiments

Citation