Unable to convert Mistral-7B-OpenOrca to GGUF #3583

Nate687 · 2023-10-11T13:57:42Z

Hello,

I am attempting to convert the Mistral-7B-OpenOrca to GGUF using "convert.py"
I understand that TheBloke has released a GGUF version, however I am wanting to convert it myself on my local computer.

However I keep getting the error:

Exception: Expected added token IDs to be sequential and start at 6; got [0, 1, 2, 32000, 32001, 32002]

I believe this is due to the additional "added_tokens.json" file that Mistral-7B-OpenOrca has.

A similar issue was reported here

However there was no fix?

Any solutions or pointers would be greatly appreciated.

Thanks,
Nate

The text was updated successfully, but these errors were encountered:

seungduk-yanolja · 2023-10-11T16:38:34Z

        self.vocab_size_base: int = vocab_size
        self.vocab_size: int = self.vocab_size_base + 2

find this in SentencePieceVocab (convert.py) and just use + 2

class SentencePieceVocab:
    def __init__(self, fname_tokenizer: Path, fname_added_tokens: Path | None) -> None:
        self.sentencepiece_tokenizer = SentencePieceProcessor(str(fname_tokenizer))
        added_tokens: dict[str, int]
        if fname_added_tokens is not None:
            added_tokens = json.load(open(fname_added_tokens, encoding="utf-8"))
        else:
            added_tokens = {}

        vocab_size: int = self.sentencepiece_tokenizer.vocab_size()
        expected_ids = list(range(vocab_size, vocab_size + len(added_tokens)))
        actual_ids   = sorted(added_tokens.values())
#        if expected_ids != actual_ids:
#            raise Exception(f"Expected added token IDs {expected_ids} to be sequential and start at {len(added_tokens)}; got {actual_ids}")

        items = sorted(added_tokens.items(), key=lambda text_idx: text_idx[1])
        added = []
        override = {}
        for text, idx in items:
            if idx < vocab_size:
                override[idx] = text
            else:
                added.append(text)
        self.override_tokens: dict[int, str] = override
        self.added_tokens_list = added
        self.vocab_size_base: int = vocab_size
        self.vocab_size: int = self.vocab_size_base + 3
        self.fname_tokenizer = fname_tokenizer
        self.fname_added_tokens = fname_added_tokens

    def sentencepiece_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
        tokenizer = self.sentencepiece_tokenizer
        for i in range(tokenizer.vocab_size()):
            piece = tokenizer.id_to_piece(i) if i not in self.override_tokens else self.override_tokens[i]
            text: bytes = piece.encode("utf-8")
            score: float = tokenizer.get_score(i)

mine is + 3 because I added a new token but yours should be + 2. did you finetune the model and added new tokens?
https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/blob/main/added_tokens.json

seungduk-yanolja · 2023-10-11T17:55:16Z

wrote a PR #3585

staviq · 2023-10-11T22:48:37Z

For a quick workaround, for this model specifically, in added_tokens.json delete lines with tokens 0,1,2, because those particular tokens are basically always assumed to be in the vocab already, and you are not removing them, that definition isn't needed in this specific case.

github-actions · 2024-04-04T01:08:31Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 4, 2024

scenaristeur mentioned this issue May 29, 2024

Albert au format GGUF etalab-ia/franceservices-backend#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to convert Mistral-7B-OpenOrca to GGUF #3583

Unable to convert Mistral-7B-OpenOrca to GGUF #3583

Nate687 commented Oct 11, 2023 •

edited

Loading

seungduk-yanolja commented Oct 11, 2023 •

edited

Loading

seungduk-yanolja commented Oct 11, 2023

staviq commented Oct 11, 2023

github-actions bot commented Apr 4, 2024

Unable to convert Mistral-7B-OpenOrca to GGUF #3583

Unable to convert Mistral-7B-OpenOrca to GGUF #3583

Comments

Nate687 commented Oct 11, 2023 • edited Loading

seungduk-yanolja commented Oct 11, 2023 • edited Loading

seungduk-yanolja commented Oct 11, 2023

staviq commented Oct 11, 2023

github-actions bot commented Apr 4, 2024

Nate687 commented Oct 11, 2023 •

edited

Loading

seungduk-yanolja commented Oct 11, 2023 •

edited

Loading