Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to convert Mistral-7B-OpenOrca to GGUF #3583

Closed
Nate687 opened this issue Oct 11, 2023 · 4 comments
Closed

Unable to convert Mistral-7B-OpenOrca to GGUF #3583

Nate687 opened this issue Oct 11, 2023 · 4 comments
Labels

Comments

@Nate687
Copy link

Nate687 commented Oct 11, 2023

Hello,

I am attempting to convert the Mistral-7B-OpenOrca to GGUF using "convert.py"
I understand that TheBloke has released a GGUF version, however I am wanting to convert it myself on my local computer.

However I keep getting the error:

Exception: Expected added token IDs to be sequential and start at 6; got [0, 1, 2, 32000, 32001, 32002]

I believe this is due to the additional "added_tokens.json" file that Mistral-7B-OpenOrca has.

A similar issue was reported here

However there was no fix?

Any solutions or pointers would be greatly appreciated.

Thanks,
Nate

@seungduk-yanolja
Copy link

seungduk-yanolja commented Oct 11, 2023

        self.vocab_size_base: int = vocab_size
        self.vocab_size: int = self.vocab_size_base + 2

find this in SentencePieceVocab (convert.py) and just use + 2

class SentencePieceVocab:
    def __init__(self, fname_tokenizer: Path, fname_added_tokens: Path | None) -> None:
        self.sentencepiece_tokenizer = SentencePieceProcessor(str(fname_tokenizer))
        added_tokens: dict[str, int]
        if fname_added_tokens is not None:
            added_tokens = json.load(open(fname_added_tokens, encoding="utf-8"))
        else:
            added_tokens = {}

        vocab_size: int = self.sentencepiece_tokenizer.vocab_size()
        expected_ids = list(range(vocab_size, vocab_size + len(added_tokens)))
        actual_ids   = sorted(added_tokens.values())
#        if expected_ids != actual_ids:
#            raise Exception(f"Expected added token IDs {expected_ids} to be sequential and start at {len(added_tokens)}; got {actual_ids}")

        items = sorted(added_tokens.items(), key=lambda text_idx: text_idx[1])
        added = []
        override = {}
        for text, idx in items:
            if idx < vocab_size:
                override[idx] = text
            else:
                added.append(text)
        self.override_tokens: dict[int, str] = override
        self.added_tokens_list = added
        self.vocab_size_base: int = vocab_size
        self.vocab_size: int = self.vocab_size_base + 3
        self.fname_tokenizer = fname_tokenizer
        self.fname_added_tokens = fname_added_tokens

    def sentencepiece_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
        tokenizer = self.sentencepiece_tokenizer
        for i in range(tokenizer.vocab_size()):
            piece = tokenizer.id_to_piece(i) if i not in self.override_tokens else self.override_tokens[i]
            text: bytes = piece.encode("utf-8")
            score: float = tokenizer.get_score(i)

mine is + 3 because I added a new token but yours should be + 2. did you finetune the model and added new tokens?
https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/blob/main/added_tokens.json

@seungduk-yanolja
Copy link

wrote a PR #3585

@staviq
Copy link
Contributor

staviq commented Oct 11, 2023

For a quick workaround, for this model specifically, in added_tokens.json delete lines with tokens 0,1,2, because those particular tokens are basically always assumed to be in the vocab already, and you are not removing them, that definition isn't needed in this specific case.

@github-actions github-actions bot added the stale label Mar 19, 2024
Copy link
Contributor

github-actions bot commented Apr 4, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants