New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

数据集代码是否存在问题？输入和输出是对齐的，没有Teacher Forcing。 #151

Open

albert-jeffery opened this issue Dec 18, 2024 · 0 comments

albert-jeffery commented Dec 18, 2024

Teacher Forcing是指的是输入和输出错开一个位置，从而使得模型能够具有预测能力。
但是博主在这里的数据集代码并没有使得输入输出错开：

# ChatGLM3需要增加[gMASK]、sop两个标记
input_ids = [tokenizer.get_command("[gMASK]"),
             tokenizer.get_command("sop")] + src_tokens + tgt_tokens + [tokenizer.eos_token_id]
context_length = len(src_tokens) + 2
labels = [-100] * context_length + input_ids[context_length:]

加上input[1,2,3], output[4,5,6]，经过上述代码处理变成，其中64790为gmask，64792为sop，2为eos

[64790, 64792, 1, 2, 3, 4, 5, 6, 2]
[-100, 64792, 1, 2, 3, 4, 5, 6, 2]

这是为什么？是不是写错了？

The text was updated successfully, but these errors were encountered:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment