Make k-quants work with tensor dimensions that are not multiple of 256 #1919

ikawrakow · 2023-06-18T07:32:22Z

As discussed in #1602, k-quants do not work for the Falcon-7B model. This is due to the fact that the number of columns in many tensors (4544) is not divisible by 256, which is the super-block size of the k-quants.

It would be useful if k-quants could be adapted to work in such cases.

The text was updated successfully, but these errors were encountered:

SlyEcho · 2023-06-18T21:43:42Z

Same for OpenLLaMA 3B.

debackerl · 2023-06-19T07:15:00Z

Does it make sense to improve older quantization algorithms if they are not state of the art? SqueezeLLM has just been released, beating GPTQ on perplexity and speed, and they also released their CUDA kernels.

https://github.com/SqueezeAILab/SqueezeLLM

Just my 2 cents :-)

ikawrakow · 2023-06-19T12:12:28Z

Does it make sense to improve older quantization algorithms if they are not state of the art? SqueezeLLM has just been released, beating GPTQ on perplexity and speed, and they also released their CUDA kernels.

https://github.com/SqueezeAILab/SqueezeLLM

Just my 2 cents :-)

My concept is that k-quants are SOTA. The discussion in #1595 and the description of #1684 shed some light on why I think k-quants are SOTA. Perhaps you could elaborate on how you arrived at the conclusion that the SquezzeLLM approach is better?

KerfuffleV2 · 2023-06-19T16:24:26Z

This might be a crazy idea, but what if the remainder not divisible by QK_K was just left as f32 or f16 and not quantized at all?

It might need some special logic to handle but it would only apply for the very last partial "block" and it should be possible to calculate if that's necessary outside of any loops.

ikawrakow · 2023-06-19T17:22:56Z

I'm on it. The current thinking is that I will add padding such that I have a multiple of 256. When quantizing, the values that are beyond the actual tensor size will be assumed to be zero. When de-quantizing, one needs to take care to not dequantize values beyond the actual tensor size. Same applies to dot products.

It is a pretty big change, so it will take some time.

The alternative that was proposed somewhere above is to use super-blocks of 64. This will work for Falcon-7b and for OpenLLaMa 3B. It is a potentially smaller change, but using super-blocks of 64 almost defeats the purpose of the super-blocks, which is to save bits by using quantized scales for the blocks inside a super-block. To give a specific example, with a super-block of 256 and Q4_K, we have 8 blocks of 32, each having a scale and a min of 6 bits, so that's 8 * 12 = 96 bits. We then have the fp16 scale and min of the super-block, which is another 32 bits for a total of 128 bits per super-block, or 0.5 bits of extra data per weight. For a super-block size of 64 we have 2 * 12 + 32 = 56 bits per super-block, or 0.875 bits per weight. That's almost the same as Q4_1 (1 extra bit per weight), so we might as well add to Q4_1 the rmse+cosine distance minimization that is used in Q4_K while quantizing and just use a modified version of Q4_1.

SlyEcho · 2023-06-19T18:25:24Z

Padding per row should be possible, after all we store the row length and its size in bytes separately.

TheBloke · 2023-06-20T20:03:55Z

Just to add that, FYI, I just learned that of another type of model that's affected: certain Llama models based on OpenAssistant, which have a vocab size of 32016

Example model exhibiting this: https://huggingface.co/MetaIX/GPT4-X-Alpasta-30b

convert.py:
Writing vocab...
[  1/543] Writing tensor tok_embeddings.weight                  | size  32016 x   6656  | type UnquantizedDataType(name='F16')
[  2/543] Writing tensor norm.weight                            | size   6656           | type UnquantizedDataType(name='F32')
[  3/543] Writing tensor output.weight                          | size  32016 x   6656  | type UnquantizedDataType(name='F16')
[  4/543] Writing tensor layers.0.attention.wq.weight           | size   6656 x   6656  | type UnquantizedDataType(name='F16')
[  5/543] Writing tensor layers.0.attention.wk.weight           | size   6656 x   6656  | type UnquantizedDataType(name='F16')
...
quantize:
llama.cpp: loading model from /workspace/process/alpasta-30b/ggml/alpasta-30b.ggmlv3.fp16.bin
llama.cpp: saving model to /workspace/process/alpasta-30b/ggml/alpasta-30b.ggmlv3.q2_K.bin
========================= Tensor sizes 6656 x 32016 are not divisible by 256
This is required to be able to use k-quants for now!
========================================================================================

Out of interest, did something change with regards to this in the last week or two? Because 11 days ago I quantised OpenAssistant-SFT-7 which also uses 32016 x 6656 and it quantised fine: https://huggingface.co/TheBloke/OpenAssistant-SFT-7-Llama-30B-GGML/tree/main

KerfuffleV2 · 2023-06-20T20:58:31Z

Out of interest, did something change with regards to this in the last week or two?

A check to make sure the sizes were compatible with k-quants was added (and to fail if it's not). Before that parts of the tensors might have been corrupted or possibly it would cause GGML to read/write memory out of bounds.

So even if it might have seemed like it was working, there were probably issues and unfortunately it couldn't be left in the current state.

TheBloke · 2023-06-20T21:15:54Z

Ah I see! I guess I should delete those k-quants from the OpenAssistant-based repos then.

Thanks for the details.

ymcui · 2023-06-26T05:43:02Z

Greetings from Chinese-LLaMA-Alpaca project.

I would like to report that after PR #1921 being merged into main branch, our models can no longer be quantized in k-quants series, while they are functional before that PR.

The reason is that the vocabulary size of our model is not divisible by 256. For example, our Chinese Alpaca model has a vocabulary size of 49954. As k-quants series generally has better performance, it is really a pity that we can no longer use this feature. Especially, for larger models (like 33B or 65B), q3_k or lower quantization method are exceptionally useful, as q4_0 or q5_0 won't fit into RAM for most people.

Looking forward to some workaround for this in the future.

ikawrakow · 2023-06-26T12:30:35Z

@ymcui

I would like to report that after PR #1921 being merged into main branch, our models can no longer be quantized in k-quants series, while they are functional before that PR.

Sorry about that. If the model worked for you before #1921, the solution is to change https://github.com/ggerganov/llama.cpp/blob/447ccbe8c39332fcdd0d98a041b6e2ff6f06219d/llama.cpp#L2510
to

if (nx % QK_K != 0) {

Basically, some people wasted a lot of time trying to figure out why their models weren't working with k-quants. To prevent this while a solution is being worked on, I added this check in #1921. The check is in many cases more restrictive that it needs to be, but I wanted to be certain that nobody is wasting their time again.

ikawrakow added enhancement New feature or request model Model specific labels Jun 18, 2023

ikawrakow mentioned this issue Jun 18, 2023

Prevent usage of k-quants when tensor size is not a multiple of 256 #1921

Merged

KerfuffleV2 mentioned this issue Jun 21, 2023

Add support for new k-quants quantization format rustformers/llm#301

Closed

ikawrakow mentioned this issue Jun 26, 2023

k-quants with super-block size of 64 #2001

Merged

ggerganov closed this as completed in #2001 Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make k-quants work with tensor dimensions that are not multiple of 256 #1919

Make k-quants work with tensor dimensions that are not multiple of 256 #1919

ikawrakow commented Jun 18, 2023

SlyEcho commented Jun 18, 2023

debackerl commented Jun 19, 2023

ikawrakow commented Jun 19, 2023

KerfuffleV2 commented Jun 19, 2023

ikawrakow commented Jun 19, 2023

SlyEcho commented Jun 19, 2023

TheBloke commented Jun 20, 2023 •

edited

Loading

KerfuffleV2 commented Jun 20, 2023

TheBloke commented Jun 20, 2023

ymcui commented Jun 26, 2023 •

edited

Loading

ikawrakow commented Jun 26, 2023

Make k-quants work with tensor dimensions that are not multiple of 256 #1919

Make k-quants work with tensor dimensions that are not multiple of 256 #1919

Comments

ikawrakow commented Jun 18, 2023

SlyEcho commented Jun 18, 2023

debackerl commented Jun 19, 2023

ikawrakow commented Jun 19, 2023

KerfuffleV2 commented Jun 19, 2023

ikawrakow commented Jun 19, 2023

SlyEcho commented Jun 19, 2023

TheBloke commented Jun 20, 2023 • edited Loading

KerfuffleV2 commented Jun 20, 2023

TheBloke commented Jun 20, 2023

ymcui commented Jun 26, 2023 • edited Loading

ikawrakow commented Jun 26, 2023

TheBloke commented Jun 20, 2023 •

edited

Loading

ymcui commented Jun 26, 2023 •

edited

Loading