[Falcon] Use stated vocab size #2914

akawrykow · 2023-08-30T19:11:58Z

In all of these cases, there is a vocab_size in config.json with the correct size, but tokenizer.json has an incorrect amount of tokens compared to the vocab size. Later on, the inference is expecting a tensor with vocab_size as one of its dimensions but gets <actual count of tokens> instead.

At least in the case of #2894, there is some configuration for an extra 'pad' token which makes up the difference (we are only missing a single token). However for #2868, the difference is much larger and I wasn't able to figure out where those tokens were supposed to come from.

In both cases, this fix was able to produce a gguf which doesn't run into that mismatch issue later on. That's because we already have some logic to introduce pad tokens if the ID is not found: https://github.com/ggerganov/llama.cpp/blob/71d6975559acfd6c8407a4ef8275a9979c737765/convert-falcon-hf-to-gguf.py#L155-L157

ggerganov · 2023-08-30T19:34:09Z

I'm looking at https://huggingface.co/tiiuae/falcon-rw-7b and it's very confusing:

The readme states 65024 tokens:
vocab.config states 50304 tokens
tokenizer.json contains 65024 tokens:

It seems that for this model the change in this PR would make it think the vocab is 50304, while it seems to be 65024.

akawrykow · 2023-08-30T20:10:38Z

Yea, it is confusing, but I'm pretty sure the inference will fail (model architecture differences aside). See e.g: #2887 (comment)

But maybe there is some yet-undiscovered way that will make these two numbers agree

akawrykow · 2023-08-30T20:13:49Z

I just noticed that vocab.json has 50304: https://huggingface.co/tiiuae/falcon-rw-1b/raw/e4b9872bb803165eb22f0a867d4e6a64d34fce19/vocab.json

So in this case, using the stated vocab size would be fine, but we would also have to load the tokens from this file instead of tokenizer.json. Why is it like this???

Mihaiii · 2023-09-12T11:59:58Z

Any updates on this one?

Use stated vocab size

3a7e9eb

akawrykow force-pushed the falcon-vocab-fixes branch from ebb4746 to 3a7e9eb Compare August 30, 2023 19:17

ggerganov approved these changes Aug 30, 2023

View reviewed changes

akawrykow mentioned this pull request Aug 31, 2023

[User] “'token_embd.weight' has wrong shape” when loading WizardLM-Uncensored-Falcon-40b #2894

Closed

klosax approved these changes Sep 3, 2023

View reviewed changes

ggerganov merged commit 5c872db into ggml-org:master Sep 14, 2023

KerfuffleV2 mentioned this pull request Sep 21, 2023

Try to fix Baichuan2 models by using vocab size in config.json #3299

Merged

pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023

falcon : use stated vocab size (ggml-org#2914)

b451b1c

Govind-S-B mentioned this pull request Dec 22, 2023

Support for conversion of falcon-rw-1b to gguf format #4580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Falcon] Use stated vocab size #2914

[Falcon] Use stated vocab size #2914

akawrykow commented Aug 30, 2023 •

edited

Loading

ggerganov commented Aug 30, 2023

akawrykow commented Aug 30, 2023

akawrykow commented Aug 30, 2023

Mihaiii commented Sep 12, 2023

[Falcon] Use stated vocab size #2914

[Falcon] Use stated vocab size #2914

Conversation

akawrykow commented Aug 30, 2023 • edited Loading

ggerganov commented Aug 30, 2023

akawrykow commented Aug 30, 2023

akawrykow commented Aug 30, 2023

Mihaiii commented Sep 12, 2023

akawrykow commented Aug 30, 2023 •

edited

Loading