Beyond English: Assessing Memorization of Translated Texts in Large Language Models

Overview

This repository contains the research conducted for the paper titled "Beyond English: Assessing Memorization of Translated Texts in Large Language Models" as part of the ERSP 2023-24 program. The goal of this research is to assess memorization of books in English, Spanish, Vietnamese, and Turiskh in LLMs.

🚀 Hypotheses & Research Questions

1. Translation Memorization

Hypothesis: Large language models (LLMs) memorize the content of translated books.
Follow-up Question: Do the models perform better in English if the original work is in Turkish, Spanish, and Vietnamese?

2. Cross-Lingual Memorization

Hypothesis: LLMs can transfer their memorization across languages.
Follow-up Question: Can LLMs memorize translations into languages not present in their pre-training dataset, and will their performance remain strong for out-of-distribution languages?

👩🏻‍💻 Contributors


Alisha Srivastava	Nhat Minh Le	Emir Korukluogu

Special Thanks 🌟

Chau Minh Pham - For guiding our research and being our research mentor.
Dr. Marzena Karpinska - For guiding our research and for her invaluable expertise.
Dr. Mohit Iyyer - For guiding our research and being our research advisor.

🏗️ Dataset Construction

Collect 25 books in English, Turkish, Vietnamese, and Spanish.

-- from Project Guttenberg and Online Sources

Process Books for passages >39 tokens & containing one Named Entity

Extract excerpts from different languages while ensuring they contain full sentences and a single named entity.
Clean text of metadata and align excerpts across four languages.
Retain excerpts that pass length checks, contain only one named entity, and verify alignment.

Experiments

1. Setup

Models Used:
- OpenAI API for GPT-4o
- Claude API for Claude
- Unity + vLLM for LLaMA models
- Fireworks API for LLaMA-3.1-405B-instruct

2. Experiment Types

Experiment 0: Direct Probing
- Assessing accuracy based on exact and fuzzy matches.
Experiment 1: Name Cloze Task
- Input excerpts with masked names and evaluate exact matches.
Experiment 2: Prefix Probing/Continuation Generation
- Prompting models to continue provided sentences and evaluating performance metrics.

Analyses

Model capacity and quantization effects.
Examination of quotes and named entities prevalence.
Investigate how prefix token counts affect model performance.

Contact

For any inquiries or discussions related to this research, please contact Alisha Srivastava at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
alignment		alignment
olmo-search		olmo-search
results		results
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
FinalizedPrompts.md		FinalizedPrompts.md
README.md		README.md
vllmtest.py		vllmtest.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond English: Assessing Memorization of Translated Texts in Large Language Models

Overview

🚀 Hypotheses & Research Questions

1. Translation Memorization

2. Cross-Lingual Memorization

👩🏻‍💻 Contributors

Special Thanks 🌟

🏗️ Dataset Construction

Collect 25 books in English, Turkish, Vietnamese, and Spanish.

Process Books for passages >39 tokens & containing one Named Entity

Experiments

1. Setup

2. Experiment Types

Analyses

Contact

About

Releases

Packages

Contributors 3

Languages

nhminle/BEAM

Folders and files

Latest commit

History

Repository files navigation

Beyond English: Assessing Memorization of Translated Texts in Large Language Models

Overview

🚀 Hypotheses & Research Questions

1. Translation Memorization

2. Cross-Lingual Memorization

👩🏻‍💻 Contributors

Special Thanks 🌟

🏗️ Dataset Construction

Collect 25 books in English, Turkish, Vietnamese, and Spanish.

Process Books for passages >39 tokens & containing one Named Entity

Experiments

1. Setup

2. Experiment Types

Analyses

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages