Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

数据集代码是否存在问题?输入和输出是对齐的,没有Teacher Forcing。 #151

Open
albert-jeffery opened this issue Dec 18, 2024 · 0 comments

Comments

@albert-jeffery
Copy link

Teacher Forcing是指的是输入和输出错开一个位置,从而使得模型能够具有预测能力。
但是博主在这里的数据集代码并没有使得输入输出错开:

# ChatGLM3需要增加[gMASK]、sop两个标记
input_ids = [tokenizer.get_command("[gMASK]"),
             tokenizer.get_command("sop")] + src_tokens + tgt_tokens + [tokenizer.eos_token_id]
context_length = len(src_tokens) + 2
labels = [-100] * context_length + input_ids[context_length:]

加上input[1,2,3], output[4,5,6],经过上述代码处理变成,其中64790为gmask,64792为sop,2为eos

[64790, 64792, 1, 2, 3, 4, 5, 6, 2]
[-100, 64792, 1, 2, 3, 4, 5, 6, 2]

这是为什么?是不是写错了?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant