-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make k-quants work with tensor dimensions that are not multiple of 256 #1919
Comments
Same for OpenLLaMA 3B. |
Does it make sense to improve older quantization algorithms if they are not state of the art? SqueezeLLM has just been released, beating GPTQ on perplexity and speed, and they also released their CUDA kernels. https://github.com/SqueezeAILab/SqueezeLLM Just my 2 cents :-) |
My concept is that k-quants are SOTA. The discussion in #1595 and the description of #1684 shed some light on why I think k-quants are SOTA. Perhaps you could elaborate on how you arrived at the conclusion that the SquezzeLLM approach is better? |
This might be a crazy idea, but what if the remainder not divisible by It might need some special logic to handle but it would only apply for the very last partial "block" and it should be possible to calculate if that's necessary outside of any loops. |
I'm on it. The current thinking is that I will add padding such that I have a multiple of 256. When quantizing, the values that are beyond the actual tensor size will be assumed to be zero. When de-quantizing, one needs to take care to not dequantize values beyond the actual tensor size. Same applies to dot products. It is a pretty big change, so it will take some time. The alternative that was proposed somewhere above is to use super-blocks of 64. This will work for Falcon-7b and for OpenLLaMa 3B. It is a potentially smaller change, but using super-blocks of 64 almost defeats the purpose of the super-blocks, which is to save bits by using quantized scales for the blocks inside a super-block. To give a specific example, with a super-block of 256 and |
Padding per row should be possible, after all we store the row length and its size in bytes separately. |
Just to add that, FYI, I just learned that of another type of model that's affected: certain Llama models based on OpenAssistant, which have a vocab size of 32016 Example model exhibiting this: https://huggingface.co/MetaIX/GPT4-X-Alpasta-30b
Out of interest, did something change with regards to this in the last week or two? Because 11 days ago I quantised OpenAssistant-SFT-7 which also uses 32016 x 6656 and it quantised fine: https://huggingface.co/TheBloke/OpenAssistant-SFT-7-Llama-30B-GGML/tree/main |
A check to make sure the sizes were compatible with k-quants was added (and to fail if it's not). Before that parts of the tensors might have been corrupted or possibly it would cause GGML to read/write memory out of bounds. So even if it might have seemed like it was working, there were probably issues and unfortunately it couldn't be left in the current state. |
Ah I see! I guess I should delete those k-quants from the OpenAssistant-based repos then. Thanks for the details. |
Greetings from Chinese-LLaMA-Alpaca project. I would like to report that after PR #1921 being merged into The reason is that the vocabulary size of our model is not divisible by 256. For example, our Chinese Alpaca model has a vocabulary size of 49954. As Looking forward to some workaround for this in the future. |
Sorry about that. If the model worked for you before #1921, the solution is to change https://github.com/ggerganov/llama.cpp/blob/447ccbe8c39332fcdd0d98a041b6e2ff6f06219d/llama.cpp#L2510
Basically, some people wasted a lot of time trying to figure out why their models weren't working with k-quants. To prevent this while a solution is being worked on, I added this check in #1921. The check is in many cases more restrictive that it needs to be, but I wanted to be certain that nobody is wasting their time again. |
As discussed in #1602, k-quants do not work for the Falcon-7B model. This is due to the fact that the number of columns in many tensors (
4544
) is not divisible by256
, which is the super-block size of the k-quants.It would be useful if k-quants could be adapted to work in such cases.
The text was updated successfully, but these errors were encountered: