In this project, we seek to build machine learning models that can differentiate between translated texts and non-translated texts in English using previously hypothesized characteristics of translated texts, which were coined “translationese.” We succeeded in building seven models that are combinations of different classifiers and features that seem promising in their ability to distinguish between these two types of texts. Our best model is a simple neural network trained on TF-IDF features and reaches 86.7% accuracy, 82.4% precision, and 93.3% recall performance on our test set. Our success in building these models affirms the hypothesis that distinct features exist within translated texts, and also supports the fact that these features are apparent enough for simple machine learning models to distinguish between translated and non-translated texts.
scikit-learn 0.24.0
pandas 1.2.1
nltk 3.5
matplotlib 3.3.3
numpy 1.19.2
spacy 2.3.5
en-core-web-sm 2.3.1
keras 2.4.3
tensorflow 2.4.1
seaborn 0.11.1
(for displaying confusion matrices if desired)
You can interact with our implementation through a Jupyter Notebook, as in classify_texts.ipynb
.
import pandas as pd
import numpy as np
from classify_trans import prepare_new, classify_new, get_validation_set
from models_util import plot_confusion_matrix
get_validation_set()
returns the feature matrix (X_valid
) and the labels (y_valid
) of the validation set we used to develop our models.
X_valid, y_valid = get_validation_set()
classify_new()
applies all models to the given feature matrix, compares them against the correct labels, and returns two dataframes:
models
shows the accuracy, precision, and recall of each model when applied to the datasettexts
shows the output of each model on each item in the dataset
models, texts = classify_new(X_valid, y_valid)
For example, models
might look like the following (these are statistics on our validation set):
model | accuracy | precision | recall | |
---|---|---|---|---|
0 | NB, custom + BOW | 0.869565 | 0.806452 | 1.00 |
1 | NB, TF-IDF | 0.782609 | 0.714286 | 1.00 |
2 | LR, custom + BOW | 0.804348 | 0.833333 | 0.80 |
3 | SVM, custom + BOW | 0.804348 | 0.807692 | 0.84 |
4 | Neural Net, TF-IDF | 0.913043 | 0.888889 | 0.96 |
5 | spaCy, BOW (LR) | 0.652174 | 0.645161 | 0.80 |
6 | spaCy, TF-IDF (LR) | 0.782609 | 0.727273 | 0.96 |
Please save all text files in a single directory. Note that texts must be named 1
, 2
, ..., 99
, 100
, 101
, ... That is, the file names of the texts must be numbered from 1 up to however many texts there are. The numbering must be consecutive (e.g., no jumping from file 5
to 7
).
prepare_new()
takes in the path to the directory in which we store our texts, an integer indicating the number of files in the folder, and a list of correct labels (1
if a translation, 0
otherwise) in corresponding order. By default, prepare_new()
skips the first two lines of each file, because we formatted our texts such that each file starts with a line containing the source language (if a translation) and the URL of the article, followed by a blank line, followed by the actual text. Set the optional parameter start=True
to not skip the first two lines.
test_y = [0]
df, test_v = prepare_new("test", 1, test_y) # prepare_new("test", 1, test_y, start=True)
df
is a dataframe that contains the following information for each text:
- text content (
df['text']
) - average sentence length (in words) (
df['avg_sent']
) - average word length (in characters) (
df['avg_word']
) - percentage of stopwords in text (
df['stopwords']
) - correct label (
0
or1
) (df['label']
)
test_v
is a feature matrix in a format compatible for our models.
Classify the texts with classify_new()
.
test_models, test_texts = classify_new(test_v, test_y)
Below is test_texts
on a dataset containing a single article, with file name 1
, in the directory test
.
actual_label | NB, custom + BOW | NB, TF-IDF | LR, custom + BOW | SVM, custom + BOW | Neural Net, TF-IDF | spaCy, BOW (LR) | spaCy, TF-IDF (LR) | file | text | |
---|---|---|---|---|---|---|---|---|---|---|
0 | english | 0 | 0 | 0 | 0 | False | 0 | 0 | test_1 | This article fell in our laps completely by su... |
classify_trans.py
: trains models on our dataset and contains functions to prepare new texts for classificationmodels_util.py
: helper functions used inclassify_trans.py
classify_texts.ipynb
: example of usagemodels.ipynb
: notebook for experimenting with models (not very relevant)text_classification_spaCy.ipynb
: notebook for experimenting with spaCy (not very relevant)
english
: 83 original English-language articles making up our train & validation setstranslation
: 100 translated English articles making up our train & validation setstest
: 30 original English and translated articles making up our test set
Pearl Hwang, Cynthia Lin
Yale University
LING 227: Language & Computation
May 2021