Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Falcon] Use stated vocab size #2914

Merged
merged 1 commit into from
Sep 14, 2023

Conversation

akawrykow
Copy link
Contributor

@akawrykow akawrykow commented Aug 30, 2023

See e.g #2894 and #2868

In all of these cases, there is a vocab_size in config.json with the correct size, but tokenizer.json has an incorrect amount of tokens compared to the vocab size. Later on, the inference is expecting a tensor with vocab_size as one of its dimensions but gets <actual count of tokens> instead.

At least in the case of #2894, there is some configuration for an extra 'pad' token which makes up the difference (we are only missing a single token). However for #2868, the difference is much larger and I wasn't able to figure out where those tokens were supposed to come from.

In both cases, this fix was able to produce a gguf which doesn't run into that mismatch issue later on. That's because we already have some logic to introduce pad tokens if the ID is not found: https://github.com/ggerganov/llama.cpp/blob/71d6975559acfd6c8407a4ef8275a9979c737765/convert-falcon-hf-to-gguf.py#L155-L157

@ggerganov
Copy link
Member

I'm looking at https://huggingface.co/tiiuae/falcon-rw-7b and it's very confusing:

It seems that for this model the change in this PR would make it think the vocab is 50304, while it seems to be 65024.

@akawrykow
Copy link
Contributor Author

Yea, it is confusing, but I'm pretty sure the inference will fail (model architecture differences aside). See e.g: #2887 (comment)

But maybe there is some yet-undiscovered way that will make these two numbers agree

@akawrykow
Copy link
Contributor Author

I just noticed that vocab.json has 50304: https://huggingface.co/tiiuae/falcon-rw-1b/raw/e4b9872bb803165eb22f0a867d4e6a64d34fce19/vocab.json

So in this case, using the stated vocab size would be fine, but we would also have to load the tokens from this file instead of tokenizer.json. Why is it like this???

@Mihaiii
Copy link
Contributor

Mihaiii commented Sep 12, 2023

Any updates on this one?

@ggerganov ggerganov merged commit 5c872db into ggml-org:master Sep 14, 2023
pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants