[misc] fix new tokens adding #7253

flashJd · 2025-03-11T10:54:42Z

What does this PR do?

Background:
When sft distilling data from deepseek-R1, we want to add two new tokens and to vocab
but after add two configs: new_special_tokens: and and resize_vocab: true
then do sft, we still can't get and in vllm reference

Cause:
1).I debugged the code and found the config new_special_tokens will call tokenizer.add_special_tokens which added new token to additional_special_tokens
2)vllm will default skip special tokens thus we can't get and in vllm reference

Fix:
In this backgroud, we should just add new tokens to vocab, not make them to be special tokens(special tokens like <|im_end|> always has special function), so just add a config: new_normal_tokens which call tokenizer.add_tokens, after that new tokens will be displayed in reference

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

[misc] fix new tokens adding (hiyouga#7119)(hiyouga#7080)

ed581e0

hiyouga added the pending This problem is yet to be addressed label Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc] fix new tokens adding #7253

[misc] fix new tokens adding #7253

flashJd commented Mar 11, 2025

[misc] fix new tokens adding #7253

Are you sure you want to change the base?

[misc] fix new tokens adding #7253

Conversation

flashJd commented Mar 11, 2025

What does this PR do?

Before submitting