Resolve double BOS token issue #462

eldarkurtic · 2025-03-03T14:27:53Z

When using chat template, the BOS token is automatically added to the context. This PR disables adding of the second BOS token, which at the moment happens automatically when the context passes through tokenizer's call which by default is called with add_special_tokens=True (ref: https://github.com/huggingface/lighteval/blob/ed084813e0bd12d82a06d9f913291fdbee774905/src/lighteval/models/vllm/vllm_model.py#L90).

Before this PR, inputs to the model would look like this:

[[151646, 151646, 151644, 50, 3948, 279, 2701 ...
<｜begin▁of▁sentence｜><｜begin▁of▁sentence｜><｜User｜>Solve the following ...

After this PR, inputs to the model would look like this:

[[151646, 151644, 50, 3948, 279, 2701 ...
<｜begin▁of▁sentence｜><｜User｜>Solve the following ...

Important Note: Right now, there is no support in lighteval to parse this argument from string input into boolean representation, so this PR will have effect ones huggingface/lighteval#598 is merged.

This disables tokenizer's call to prepend BOS token when tokenizing context which already has BOS token created by its own chat template.

Resolve double BOS token issue

6b5f6c7

This disables tokenizer's call to prepend BOS token when tokenizing context which already has BOS token created by its own chat template.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve double BOS token issue #462

Resolve double BOS token issue #462

eldarkurtic commented Mar 3, 2025

Resolve double BOS token issue #462

Are you sure you want to change the base?

Resolve double BOS token issue #462

Conversation

eldarkurtic commented Mar 3, 2025