Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[misc] fix new tokens adding #7253

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

flashJd
Copy link

@flashJd flashJd commented Mar 11, 2025

What does this PR do?

Fixes #7119
Fixes #7080

Background:
When sft distilling data from deepseek-R1, we want to add two new tokens and to vocab
but after add two configs: new_special_tokens: and and resize_vocab: true
then do sft, we still can't get and in vllm reference

Cause:
1).I debugged the code and found the config new_special_tokens will call tokenizer.add_special_tokens which added new token to additional_special_tokens
2)vllm will default skip special tokens thus we can't get and in vllm reference

Fix:
In this backgroud, we should just add new tokens to vocab, not make them to be special tokens(special tokens like <|im_end|> always has special function), so just add a config: new_normal_tokens which call tokenizer.add_tokens, after that new tokens will be displayed in reference

Before submitting

@hiyouga hiyouga added the pending This problem is yet to be addressed label Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<think>标签训练后,模型预测结果无<think>标签 special token未能输出
2 participants