Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a script to perform the switch-out augmentation technique #5

Open
Iambusayor opened this issue Jan 13, 2024 · 9 comments
Open
Assignees

Comments

@Iambusayor
Copy link
Collaborator

Iambusayor commented Jan 13, 2024

Measuring the Impact of Data Augmentation Methods for Extremely Low-Resource NMT explains the switch-out method. This technique views DA as an optimization problem, with the idea of randomly replacing words in both the source sentence and target sentence with other random words from their corresponding vocabularies. See the original paper SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation for more information. Here's a crude implementation https://github.com/MaximeNe/SwitchOut

@Iambusayor Iambusayor changed the title Create a script that performs switch-out Create a script to perform the switch-out augmentation technique Jan 13, 2024
@Iambusayor Iambusayor changed the title Create a script to perform the switch-out augmentation technique Create a script to perform the switch-out augmentation technique Jan 13, 2024
@r-chinonyelum
Copy link

I'd like to be assigned this

@r-chinonyelum
Copy link

can I work on this task with someone please?

@r-chinonyelum
Copy link

There's a sketch of the script on the original paper. There's also the crude implementation available above. Please, how can I align this to become what we need?

@r-chinonyelum
Copy link

I am having the trouble of data types mismatch

@Iambusayor
Copy link
Collaborator Author

Do you want to post the portion of your code that produced the error with the traceback error, or have you solved it?

@Iambusayor
Copy link
Collaborator Author

There's a sketch of the script on the original paper. There's also the crude implementation available above. Please, how can I align this to become what we need?

Understand what the paper explains and see if the crude implementation suffices or you've to modify/write your own.

@owos
Copy link
Owner

owos commented Jan 18, 2024

Hi @lumnolar, so I took a look at this and here's how you could work with the switchout script.

import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from switchout import hamming_distance_sample

tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50")
tau = 0.2
bos_id = tokenizer.convert_tokens_to_ids(tokenizer.bos_token)
eos_id = tokenizer.convert_tokens_to_ids(tokenizer.eos_token)
pad_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
vocab_size = tokenizer.vocab_size
padding = "max_length"

model_inputs1 = tokenizer(
    "I am a boy", padding="max_length", truncation=True, return_tensors="pt"
)

model_inputs2 = hamming_distance_sample(
    model_inputs1["input_ids"], tau, bos_id, eos_id, pad_id, tokenizer.vocab_size
)
model_inputs2_to_text = tokenizer.batch_decode(model_inputs2, skip_special_tokens=True)
print(model_inputs2_to_text)



#### working with our dataset:
data = load_dataset("masakhane/mafand", "en-yor")


def apply_switchout(examples):
    source_lang = "en"  # this should not be hard-coded
    target_lang = "yor"  # this should not be hard-coded

    inputs = [ex[source_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, padding=padding, truncation=True, return_tensors="pt"
    )
    model_inputs["input_ids"] = [
        hamming_distance_sample(
            inp.reshape(1, -1), tau, bos_id, eos_id, pad_id, vocab_size
        ).squeeze()
        for inp in model_inputs["input_ids"]
    ]
    targets = [ex[target_lang] for ex in examples["translation"]]
    labels = tokenizer(targets, padding=padding, truncation=True, return_tensors="pt")
    labels["input_ids"] = [
        hamming_distance_sample(
            trgt.reshape(1, -1), tau, bos_id, eos_id, pad_id, vocab_size
        ).squeeze()
        for trgt in labels["input_ids"]
    ]
    # what you will do moving forward from here will depend on the model you are working with, for this example I used mbart-50,
    if padding == "max_length" and False:  # data_args.ignore_pad_token_for_loss:
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label]
            for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


tokenized_data = data.map(apply_switchout, batched=True)

Please play with code to see that you understand it.

Now the next step would be to implement 2 type of switchsout:

  1. In-language switchout: do replacement using the tokens from the language
  2. Random switchout: do replacement using random tokens

How to implement 1:

  1. change the last parameter in switchout i.e vocab_size to be a list and not an int.
  2. when you need to use the vocab_size, make the implementation sample and return to token id you want from the vocab_size list.
  3. To create the vocab_size list to be passed into the hamming_distance_sample function, pick the train split, tokenize it and convert to list, and remove duplicates.

Please respond with questions for any section that you do not understand.

We have also decided that you will be the one to carry out the ablation studies needed to get best tau value, please you need to start early.

@r-chinonyelum
Copy link

Alright. Thank you very much. I'm on it

@Onoyiza
Copy link

Onoyiza commented Jan 20, 2024

Hi @lumnolar. Can I work with you on this? Do you still need someone to work on this task with you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants