Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Padding influences embedding #43

Closed
ezorita opened this issue Jan 3, 2025 · 3 comments
Closed

[Question] Padding influences embedding #43

ezorita opened this issue Jan 3, 2025 · 3 comments

Comments

@ezorita
Copy link

ezorita commented Jan 3, 2025

When using the tokenizer with padding="max_length" vs padding="longest the generated embeddings are completely different. I believe this is a bug, since padding tokens should be masked out, but it might be an intrinsic side effect on how attention is computed.

Passing the same sentence tokenized with and without padding yields a cosine similarity of 0.35.

This explains the issue described in #42.

Reproduction script:

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

max_seq_length = 32768
testing_string = "Every morning, I make a cup of coffee to start my day."
model = AutoModelForSequenceClassification.from_pretrained(
    "togethercomputer/m2-bert-80M-32k-retrieval", trust_remote_code=True
).cuda()

tokenizer = AutoTokenizer.from_pretrained(
    "bert-base-uncased", model_max_length=max_seq_length
)
input_ids = tokenizer(
    [testing_string],
    return_tensors="pt",
    padding="max_length",
    return_token_type_ids=False,
    truncation=True,
    max_length=max_seq_length,
).to("cuda")

input_ids_no_padding = tokenizer(
    [testing_string],
    return_tensors="pt",
    padding="longest",
    return_token_type_ids=False,
    truncation=True,
    max_length=max_seq_length,
).to("cuda")

with torch.no_grad():
    outputs = model(**input_ids)
    embeddings = outputs["sentence_embedding"].to("cpu")

with torch.no_grad():
    outputs = model(**input_ids_no_padding)
    embeddings_no_padding = outputs["sentence_embedding"].to("cpu")


allclose = torch.allclose(embeddings, embeddings_no_padding)
print(f"allclose: {allclose}")

# Print cosine similarity between the two embeddings
cosine_similarity = torch.nn.functional.cosine_similarity(
    embeddings, embeddings_no_padding
)
print(f"cosine similarity: {cosine_similarity}")

Output:

allclose: False
cosine similarity: tensor([0.3504])
@DanFu09
Copy link
Collaborator

DanFu09 commented Jan 3, 2025 via email

@ezorita
Copy link
Author

ezorita commented Jan 3, 2025

Thanks for your quick response @DanFu09, I believe you should explicitly state in the huggingface docs that the tokenizer must be used with padding="max_length", a python comment on that line would be enough. Otherwise users may be tempted to use longest to increase the inference efficiency.

@ezorita ezorita changed the title [bug] Padding influences embedding [Question] Padding influences embedding Jan 3, 2025
@DanFu09
Copy link
Collaborator

DanFu09 commented Jan 3, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants