Create a script to perform the switch-out augmentation technique #5

Iambusayor · 2024-01-13T21:24:21Z

Measuring the Impact of Data Augmentation Methods for Extremely Low-Resource NMT explains the switch-out method. This technique views DA as an optimization problem, with the idea of randomly replacing words in both the source sentence and target sentence with other random words from their corresponding vocabularies. See the original paper SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation for more information. Here's a crude implementation https://github.com/MaximeNe/SwitchOut

r-chinonyelum · 2024-01-15T08:58:47Z

I'd like to be assigned this

r-chinonyelum · 2024-01-15T17:47:07Z

can I work on this task with someone please?

r-chinonyelum · 2024-01-16T11:44:18Z

There's a sketch of the script on the original paper. There's also the crude implementation available above. Please, how can I align this to become what we need?

r-chinonyelum · 2024-01-16T14:20:31Z

I am having the trouble of data types mismatch

Iambusayor · 2024-01-16T23:06:23Z

Do you want to post the portion of your code that produced the error with the traceback error, or have you solved it?

Iambusayor · 2024-01-16T23:08:12Z

There's a sketch of the script on the original paper. There's also the crude implementation available above. Please, how can I align this to become what we need?

Understand what the paper explains and see if the crude implementation suffices or you've to modify/write your own.

owos · 2024-01-18T09:32:03Z

Hi @lumnolar, so I took a look at this and here's how you could work with the switchout script.

import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from switchout import hamming_distance_sample

tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50")
tau = 0.2
bos_id = tokenizer.convert_tokens_to_ids(tokenizer.bos_token)
eos_id = tokenizer.convert_tokens_to_ids(tokenizer.eos_token)
pad_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
vocab_size = tokenizer.vocab_size
padding = "max_length"

model_inputs1 = tokenizer(
    "I am a boy", padding="max_length", truncation=True, return_tensors="pt"
)

model_inputs2 = hamming_distance_sample(
    model_inputs1["input_ids"], tau, bos_id, eos_id, pad_id, tokenizer.vocab_size
)
model_inputs2_to_text = tokenizer.batch_decode(model_inputs2, skip_special_tokens=True)
print(model_inputs2_to_text)



#### working with our dataset:
data = load_dataset("masakhane/mafand", "en-yor")


def apply_switchout(examples):
    source_lang = "en"  # this should not be hard-coded
    target_lang = "yor"  # this should not be hard-coded

    inputs = [ex[source_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, padding=padding, truncation=True, return_tensors="pt"
    )
    model_inputs["input_ids"] = [
        hamming_distance_sample(
            inp.reshape(1, -1), tau, bos_id, eos_id, pad_id, vocab_size
        ).squeeze()
        for inp in model_inputs["input_ids"]
    ]
    targets = [ex[target_lang] for ex in examples["translation"]]
    labels = tokenizer(targets, padding=padding, truncation=True, return_tensors="pt")
    labels["input_ids"] = [
        hamming_distance_sample(
            trgt.reshape(1, -1), tau, bos_id, eos_id, pad_id, vocab_size
        ).squeeze()
        for trgt in labels["input_ids"]
    ]
    # what you will do moving forward from here will depend on the model you are working with, for this example I used mbart-50,
    if padding == "max_length" and False:  # data_args.ignore_pad_token_for_loss:
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label]
            for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


tokenized_data = data.map(apply_switchout, batched=True)

Please play with code to see that you understand it.

Now the next step would be to implement 2 type of switchsout:

In-language switchout: do replacement using the tokens from the language
Random switchout: do replacement using random tokens

How to implement 1:

change the last parameter in switchout i.e vocab_size to be a list and not an int.
when you need to use the vocab_size, make the implementation sample and return to token id you want from the vocab_size list.
To create the vocab_size list to be passed into the hamming_distance_sample function, pick the train split, tokenize it and convert to list, and remove duplicates.

Please respond with questions for any section that you do not understand.

We have also decided that you will be the one to carry out the ablation studies needed to get best tau value, please you need to start early.

r-chinonyelum · 2024-01-18T12:16:32Z

Alright. Thank you very much. I'm on it

Onoyiza · 2024-01-20T21:36:10Z

Hi @lumnolar. Can I work with you on this? Do you still need someone to work on this task with you?

Iambusayor changed the title ~~Create a script that performs switch-out~~ Create a script to perform the switch-out augmentation technique Jan 13, 2024

Iambusayor changed the title ~~Create a script to perform the switch-out augmentation technique~~ Create a script to perform the switch-out augmentation technique Jan 13, 2024

Iambusayor assigned Iambusayor and r-chinonyelum and unassigned Iambusayor Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a script to perform the switch-out augmentation technique #5

Create a script to perform the switch-out augmentation technique #5

Iambusayor commented Jan 13, 2024 •

edited

Loading

r-chinonyelum commented Jan 15, 2024

r-chinonyelum commented Jan 15, 2024

r-chinonyelum commented Jan 16, 2024

r-chinonyelum commented Jan 16, 2024

Iambusayor commented Jan 16, 2024

Iambusayor commented Jan 16, 2024

owos commented Jan 18, 2024

r-chinonyelum commented Jan 18, 2024

Onoyiza commented Jan 20, 2024

Create a script to perform the switch-out augmentation technique #5

Create a script to perform the switch-out augmentation technique #5

Comments

Iambusayor commented Jan 13, 2024 • edited Loading

r-chinonyelum commented Jan 15, 2024

r-chinonyelum commented Jan 15, 2024

r-chinonyelum commented Jan 16, 2024

r-chinonyelum commented Jan 16, 2024

Iambusayor commented Jan 16, 2024

Iambusayor commented Jan 16, 2024

owos commented Jan 18, 2024

r-chinonyelum commented Jan 18, 2024

Onoyiza commented Jan 20, 2024

Iambusayor commented Jan 13, 2024 •

edited

Loading