Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve double BOS token issue #462

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

eldarkurtic
Copy link

When using chat template, the BOS token is automatically added to the context. This PR disables adding of the second BOS token, which at the moment happens automatically when the context passes through tokenizer's call which by default is called with add_special_tokens=True (ref: https://github.com/huggingface/lighteval/blob/ed084813e0bd12d82a06d9f913291fdbee774905/src/lighteval/models/vllm/vllm_model.py#L90).

Before this PR, inputs to the model would look like this:

[[151646, 151646, 151644, 50, 3948, 279, 2701 ...
<|begin▁of▁sentence|><|begin▁of▁sentence|><|User|>Solve the following ...

After this PR, inputs to the model would look like this:

[[151646, 151644, 50, 3948, 279, 2701 ...
<|begin▁of▁sentence|><|User|>Solve the following ...

Important Note: Right now, there is no support in lighteval to parse this argument from string input into boolean representation, so this PR will have effect ones huggingface/lighteval#598 is merged.

This disables tokenizer's call to prepend BOS token when tokenizing context which already has BOS token created by its own chat template.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant