Skip to content

pearl6527/translation_classify

Repository files navigation

An Exploration of Models That Differentiate Translationese from Original English Texts

In this project, we seek to build machine learning models that can differentiate between translated texts and non-translated texts in English using previously hypothesized characteristics of translated texts, which were coined “translationese.” We succeeded in building seven models that are combinations of different classifiers and features that seem promising in their ability to distinguish between these two types of texts. Our best model is a simple neural network trained on TF-IDF features and reaches 86.7% accuracy, 82.4% precision, and 93.3% recall performance on our test set. Our success in building these models affirms the hypothesis that distinct features exist within translated texts, and also supports the fact that these features are apparent enough for simple machine learning models to distinguish between translated and non-translated texts.

Dependencies

  • scikit-learn 0.24.0
  • pandas 1.2.1
  • nltk 3.5
  • matplotlib 3.3.3
  • numpy 1.19.2
  • spacy 2.3.5
    • en-core-web-sm 2.3.1
  • keras 2.4.3
  • tensorflow 2.4.1
  • seaborn 0.11.1 (for displaying confusion matrices if desired)

Usage

You can interact with our implementation through a Jupyter Notebook, as in classify_texts.ipynb.

Import the relevant modules

import pandas as pd
import numpy as np
from classify_trans import prepare_new, classify_new, get_validation_set
from models_util import plot_confusion_matrix

get_validation_set() returns the feature matrix (X_valid) and the labels (y_valid) of the validation set we used to develop our models.

X_valid, y_valid = get_validation_set()

classify_new() applies all models to the given feature matrix, compares them against the correct labels, and returns two dataframes:

  • models shows the accuracy, precision, and recall of each model when applied to the dataset
  • texts shows the output of each model on each item in the dataset
models, texts = classify_new(X_valid, y_valid)

For example, models might look like the following (these are statistics on our validation set):

model accuracy precision recall
0 NB, custom + BOW 0.869565 0.806452 1.00
1 NB, TF-IDF 0.782609 0.714286 1.00
2 LR, custom + BOW 0.804348 0.833333 0.80
3 SVM, custom + BOW 0.804348 0.807692 0.84
4 Neural Net, TF-IDF 0.913043 0.888889 0.96
5 spaCy, BOW (LR) 0.652174 0.645161 0.80
6 spaCy, TF-IDF (LR) 0.782609 0.727273 0.96

Classify new texts

Please save all text files in a single directory. Note that texts must be named 1, 2, ..., 99, 100, 101, ... That is, the file names of the texts must be numbered from 1 up to however many texts there are. The numbering must be consecutive (e.g., no jumping from file 5 to 7).

prepare_new() takes in the path to the directory in which we store our texts, an integer indicating the number of files in the folder, and a list of correct labels (1 if a translation, 0 otherwise) in corresponding order. By default, prepare_new() skips the first two lines of each file, because we formatted our texts such that each file starts with a line containing the source language (if a translation) and the URL of the article, followed by a blank line, followed by the actual text. Set the optional parameter start=True to not skip the first two lines.

test_y = [0]
df, test_v = prepare_new("test", 1, test_y)  # prepare_new("test", 1, test_y, start=True)

df is a dataframe that contains the following information for each text:

  • text content (df['text'])
  • average sentence length (in words) (df['avg_sent'])
  • average word length (in characters) (df['avg_word'])
  • percentage of stopwords in text (df['stopwords'])
  • correct label (0 or 1) (df['label'])

test_v is a feature matrix in a format compatible for our models.

Classify the texts with classify_new().

test_models, test_texts = classify_new(test_v, test_y)

Below is test_texts on a dataset containing a single article, with file name 1, in the directory test.

actual_label NB, custom + BOW NB, TF-IDF LR, custom + BOW SVM, custom + BOW Neural Net, TF-IDF spaCy, BOW (LR) spaCy, TF-IDF (LR) file text
0 english 0 0 0 0 False 0 0 test_1 This article fell in our laps completely by su...

Code

  • classify_trans.py: trains models on our dataset and contains functions to prepare new texts for classification
  • models_util.py: helper functions used in classify_trans.py
  • classify_texts.ipynb: example of usage
  • models.ipynb: notebook for experimenting with models (not very relevant)
  • text_classification_spaCy.ipynb: notebook for experimenting with spaCy (not very relevant)

Directories

  • english: 83 original English-language articles making up our train & validation sets
  • translation: 100 translated English articles making up our train & validation sets
  • test: 30 original English and translated articles making up our test set



Pearl Hwang, Cynthia Lin
Yale University
LING 227: Language & Computation
May 2021

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published