Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make k-quants work with tensor dimensions that are not multiple of 256 #1919

Closed
ikawrakow opened this issue Jun 18, 2023 · 11 comments · Fixed by #2001
Closed

Make k-quants work with tensor dimensions that are not multiple of 256 #1919

ikawrakow opened this issue Jun 18, 2023 · 11 comments · Fixed by #2001
Labels
enhancement New feature or request model Model specific

Comments

@ikawrakow
Copy link
Contributor

As discussed in #1602, k-quants do not work for the Falcon-7B model. This is due to the fact that the number of columns in many tensors (4544) is not divisible by 256, which is the super-block size of the k-quants.

It would be useful if k-quants could be adapted to work in such cases.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 18, 2023

Same for OpenLLaMA 3B.

@debackerl
Copy link

Does it make sense to improve older quantization algorithms if they are not state of the art? SqueezeLLM has just been released, beating GPTQ on perplexity and speed, and they also released their CUDA kernels.

https://github.com/SqueezeAILab/SqueezeLLM

Just my 2 cents :-)

@ikawrakow
Copy link
Contributor Author

Does it make sense to improve older quantization algorithms if they are not state of the art? SqueezeLLM has just been released, beating GPTQ on perplexity and speed, and they also released their CUDA kernels.

https://github.com/SqueezeAILab/SqueezeLLM

Just my 2 cents :-)

My concept is that k-quants are SOTA. The discussion in #1595 and the description of #1684 shed some light on why I think k-quants are SOTA. Perhaps you could elaborate on how you arrived at the conclusion that the SquezzeLLM approach is better?

@KerfuffleV2
Copy link
Collaborator

This might be a crazy idea, but what if the remainder not divisible by QK_K was just left as f32 or f16 and not quantized at all?

It might need some special logic to handle but it would only apply for the very last partial "block" and it should be possible to calculate if that's necessary outside of any loops.

@ikawrakow
Copy link
Contributor Author

I'm on it. The current thinking is that I will add padding such that I have a multiple of 256. When quantizing, the values that are beyond the actual tensor size will be assumed to be zero. When de-quantizing, one needs to take care to not dequantize values beyond the actual tensor size. Same applies to dot products.

It is a pretty big change, so it will take some time.

The alternative that was proposed somewhere above is to use super-blocks of 64. This will work for Falcon-7b and for OpenLLaMa 3B. It is a potentially smaller change, but using super-blocks of 64 almost defeats the purpose of the super-blocks, which is to save bits by using quantized scales for the blocks inside a super-block. To give a specific example, with a super-block of 256 and Q4_K, we have 8 blocks of 32, each having a scale and a min of 6 bits, so that's 8 * 12 = 96 bits. We then have the fp16 scale and min of the super-block, which is another 32 bits for a total of 128 bits per super-block, or 0.5 bits of extra data per weight. For a super-block size of 64 we have 2 * 12 + 32 = 56 bits per super-block, or 0.875 bits per weight. That's almost the same as Q4_1 (1 extra bit per weight), so we might as well add to Q4_1 the rmse+cosine distance minimization that is used in Q4_K while quantizing and just use a modified version of Q4_1.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 19, 2023

Padding per row should be possible, after all we store the row length and its size in bytes separately.

@TheBloke
Copy link
Contributor

TheBloke commented Jun 20, 2023

Just to add that, FYI, I just learned that of another type of model that's affected: certain Llama models based on OpenAssistant, which have a vocab size of 32016

Example model exhibiting this: https://huggingface.co/MetaIX/GPT4-X-Alpasta-30b

convert.py:
Writing vocab...
[  1/543] Writing tensor tok_embeddings.weight                  | size  32016 x   6656  | type UnquantizedDataType(name='F16')
[  2/543] Writing tensor norm.weight                            | size   6656           | type UnquantizedDataType(name='F32')
[  3/543] Writing tensor output.weight                          | size  32016 x   6656  | type UnquantizedDataType(name='F16')
[  4/543] Writing tensor layers.0.attention.wq.weight           | size   6656 x   6656  | type UnquantizedDataType(name='F16')
[  5/543] Writing tensor layers.0.attention.wk.weight           | size   6656 x   6656  | type UnquantizedDataType(name='F16')
...
quantize:
llama.cpp: loading model from /workspace/process/alpasta-30b/ggml/alpasta-30b.ggmlv3.fp16.bin
llama.cpp: saving model to /workspace/process/alpasta-30b/ggml/alpasta-30b.ggmlv3.q2_K.bin
========================= Tensor sizes 6656 x 32016 are not divisible by 256
This is required to be able to use k-quants for now!
========================================================================================

Out of interest, did something change with regards to this in the last week or two? Because 11 days ago I quantised OpenAssistant-SFT-7 which also uses 32016 x 6656 and it quantised fine: https://huggingface.co/TheBloke/OpenAssistant-SFT-7-Llama-30B-GGML/tree/main

image

@KerfuffleV2
Copy link
Collaborator

Out of interest, did something change with regards to this in the last week or two?

A check to make sure the sizes were compatible with k-quants was added (and to fail if it's not). Before that parts of the tensors might have been corrupted or possibly it would cause GGML to read/write memory out of bounds.

So even if it might have seemed like it was working, there were probably issues and unfortunately it couldn't be left in the current state.

@TheBloke
Copy link
Contributor

Ah I see! I guess I should delete those k-quants from the OpenAssistant-based repos then.

Thanks for the details.

@ymcui
Copy link
Contributor

ymcui commented Jun 26, 2023

Greetings from Chinese-LLaMA-Alpaca project.

I would like to report that after PR #1921 being merged into main branch, our models can no longer be quantized in k-quants series, while they are functional before that PR.

The reason is that the vocabulary size of our model is not divisible by 256. For example, our Chinese Alpaca model has a vocabulary size of 49954. As k-quants series generally has better performance, it is really a pity that we can no longer use this feature. Especially, for larger models (like 33B or 65B), q3_k or lower quantization method are exceptionally useful, as q4_0 or q5_0 won't fit into RAM for most people.

Looking forward to some workaround for this in the future.

@ikawrakow
Copy link
Contributor Author

@ymcui

I would like to report that after PR #1921 being merged into main branch, our models can no longer be quantized in k-quants series, while they are functional before that PR.

Sorry about that. If the model worked for you before #1921, the solution is to change https://github.com/ggerganov/llama.cpp/blob/447ccbe8c39332fcdd0d98a041b6e2ff6f06219d/llama.cpp#L2510
to

if (nx % QK_K != 0) {

Basically, some people wasted a lot of time trying to figure out why their models weren't working with k-quants. To prevent this while a solution is being worked on, I added this check in #1921. The check is in many cases more restrictive that it needs to be, but I wanted to be certain that nobody is wasting their time again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model Model specific
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants