An Exploration of Models That Differentiate Translationese from Original English Texts

In this project, we seek to build machine learning models that can differentiate between translated texts and non-translated texts in English using previously hypothesized characteristics of translated texts, which were coined “translationese.” We succeeded in building seven models that are combinations of different classifiers and features that seem promising in their ability to distinguish between these two types of texts. Our best model is a simple neural network trained on TF-IDF features and reaches 86.7% accuracy, 82.4% precision, and 93.3% recall performance on our test set. Our success in building these models affirms the hypothesis that distinct features exist within translated texts, and also supports the fact that these features are apparent enough for simple machine learning models to distinguish between translated and non-translated texts.

Dependencies

scikit-learn 0.24.0
pandas 1.2.1
nltk 3.5
matplotlib 3.3.3
numpy 1.19.2
spacy 2.3.5
- en-core-web-sm 2.3.1
keras 2.4.3
tensorflow 2.4.1
seaborn 0.11.1 (for displaying confusion matrices if desired)

Usage

You can interact with our implementation through a Jupyter Notebook, as in classify_texts.ipynb.

Import the relevant modules

import pandas as pd
import numpy as np
from classify_trans import prepare_new, classify_new, get_validation_set
from models_util import plot_confusion_matrix

get_validation_set() returns the feature matrix (X_valid) and the labels (y_valid) of the validation set we used to develop our models.

X_valid, y_valid = get_validation_set()

classify_new() applies all models to the given feature matrix, compares them against the correct labels, and returns two dataframes:

models shows the accuracy, precision, and recall of each model when applied to the dataset
texts shows the output of each model on each item in the dataset

models, texts = classify_new(X_valid, y_valid)

For example, models might look like the following (these are statistics on our validation set):

	model	accuracy	precision	recall
0	NB, custom + BOW	0.869565	0.806452	1.00
1	NB, TF-IDF	0.782609	0.714286	1.00
2	LR, custom + BOW	0.804348	0.833333	0.80
3	SVM, custom + BOW	0.804348	0.807692	0.84
4	Neural Net, TF-IDF	0.913043	0.888889	0.96
5	spaCy, BOW (LR)	0.652174	0.645161	0.80
6	spaCy, TF-IDF (LR)	0.782609	0.727273	0.96

Classify new texts

Please save all text files in a single directory. Note that texts must be named 1, 2, ..., 99, 100, 101, ... That is, the file names of the texts must be numbered from 1 up to however many texts there are. The numbering must be consecutive (e.g., no jumping from file 5 to 7).

prepare_new() takes in the path to the directory in which we store our texts, an integer indicating the number of files in the folder, and a list of correct labels (1 if a translation, 0 otherwise) in corresponding order. By default, prepare_new() skips the first two lines of each file, because we formatted our texts such that each file starts with a line containing the source language (if a translation) and the URL of the article, followed by a blank line, followed by the actual text. Set the optional parameter start=True to not skip the first two lines.

test_y = [0]
df, test_v = prepare_new("test", 1, test_y)  # prepare_new("test", 1, test_y, start=True)

df is a dataframe that contains the following information for each text:

text content (df['text'])
average sentence length (in words) (df['avg_sent'])
average word length (in characters) (df['avg_word'])
percentage of stopwords in text (df['stopwords'])
correct label (0 or 1) (df['label'])

test_v is a feature matrix in a format compatible for our models.

Classify the texts with classify_new().

test_models, test_texts = classify_new(test_v, test_y)

Below is test_texts on a dataset containing a single article, with file name 1, in the directory test.

	actual_label	NB, custom + BOW	NB, TF-IDF	LR, custom + BOW	SVM, custom + BOW	Neural Net, TF-IDF	spaCy, BOW (LR)	spaCy, TF-IDF (LR)	file	text
0	english	0	0	0	0	False	0	0	test_1	This article fell in our laps completely by su...

Code

classify_trans.py: trains models on our dataset and contains functions to prepare new texts for classification
models_util.py: helper functions used in classify_trans.py
classify_texts.ipynb: example of usage
models.ipynb: notebook for experimenting with models (not very relevant)
text_classification_spaCy.ipynb: notebook for experimenting with spaCy (not very relevant)

Directories

english: 83 original English-language articles making up our train & validation sets
translation: 100 translated English articles making up our train & validation sets
test: 30 original English and translated articles making up our test set

Pearl Hwang, Cynthia Lin
Yale University
LING 227: Language & Computation
May 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Exploration of Models That Differentiate Translationese from Original English Texts

Dependencies

Usage

Import the relevant modules

Classify new texts

Code

Directories

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
english		english
test		test
translation		translation
Project_Report.pdf		Project_Report.pdf
README.md		README.md
classify_texts.ipynb		classify_texts.ipynb
classify_trans.py		classify_trans.py
models.ipynb		models.ipynb
models_util.py		models_util.py
text_classification_spaCy.ipynb		text_classification_spaCy.ipynb

pearl6527/translation_classify

Folders and files

Latest commit

History

Repository files navigation

An Exploration of Models That Differentiate Translationese from Original English Texts

Dependencies

Usage

Import the relevant modules

Classify new texts

Code

Directories

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages