Allow quantizing k-quants to fall back when tensor size incompatible #3747

KerfuffleV2 · 2023-10-23T15:36:35Z

Very simple change to allow quantizing with k-quants to fall back on a comparable choice when k-quants isn't compatible with the tensor dimensions. This will allow quantizing models like CausalLM with (mostly) k-quants.

Any reason not to do this? The fallback choices could be adjusted a bit, I went with trying to maintain quality over size.

Size comparison when quantizing CausalLM 14B to q4_0 vs q4_k_s with this patch: 8365107872 vs 8557804192 bytes or 8.36GB/8.55GB (7.8GiB vs8.0GiB). So pretty small, and the k-quants version should be higher quality. The converted model works perfectly, as far as I can tell.

The fallback case looks like:

[   7/ 363]                  blk.0.ffn_up.weight - [ 5120, 13696,     1,     1], type =   q8_0, quantizing to q4_K .. size =    71.05 MB ->    37.62 MB | hist: 
[   8/ 363]                blk.0.ffn_down.weight - [13696,  5120,     1,     1], type =   q8_0, 

get_k_quant_type : tensor cols 13696 x 5120 are not divisible by 256, required for q5_K - using fallback quantization q5_1
quantizing to q5_1 .. size =    71.05 MB ->    50.16 MB |

TheBloke · 2023-10-23T15:45:35Z

I was discussing this with concedo yesterday. The only concern is that if we put out quants that call themselves k-quants, but are mostly not k-quants, it could cause confusion?

I did a quick back-of-an-envelope calculation and thought I saw that 66% of CausalLM 14B couldn't be k-quanted. Did I get that wrong?

If the majority of the model is still k-quant, like your size comparison indicates, then I think this is a great change.

KerfuffleV2 · 2023-10-23T16:13:26Z

The only concern is that if we put out quants that call themselves k-quants, but are mostly not k-quants, it could cause confusion?

That's certainly a reasonable concern. A compromise might be to fail unless --allow-requantize is enabled (or maybe add a dedicated option). That way people can't accidentally make a non-k-quants-k-quants GGUF file.

Did I get that wrong?

Maybe. The requirement is actually only on the first dimension of the tensor currently. So 5120, 13696 works but 13696, 5120 doesn't.

Output from loading the 14B Q4_K_S quantized CausalLM model:

llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q5_0:   36 tensors
llama_model_loader: - type q5_1:    4 tensors
llama_model_loader: - type q4_K:  237 tensors
llama_model_loader: - type q5_K:    4 tensors
llama_model_loader: - type q6_K:    1 tensors

Per layer, there are four 5120x5120, two 5120x13696 and one 13696x5120 tensors. Only the last one can't be converted. (In terms of size a 5120x13696 or 13696x5120 is worth about 2.68 5120x5120 ones). Long story short, it's still mostly k-quants.

But you can't necessarily assume other models will be comparable, the proportions will be based on how many tensors had an incompatible first dimension.

KerfuffleV2 · 2023-10-23T16:32:59Z

Using ROCM and a long prompt + offloading I get segfaults that go away with -nommq (or disabling offload). I think -nommq only has an effect with k-quants.

It's possible it's triggering some other bug or edge case for the tensors with the second dimension not divisible by the block size.

edit: Can anyone else reproduce this kind of behavior? It would be really useful to know.

TheBloke · 2023-10-24T16:41:23Z

Per layer, there are four 5120x5120, two 5120x13696 and one 13696x5120 tensors. Only the last one can't be converted. (In terms of size a 5120x13696 or 13696x5120 is worth about 2.68 5120x5120 ones). Long story short, it's still mostly k-quants.

But you can't necessarily assume other models will be comparable, the proportions will be based on how many tensors had an incompatible first dimension.

Ahh! Yes, I just looked at the tensor layout and assumed that any tensor with 13696 would be incompatible.

That's much better then, and I'd certainly be happy releasing "k-quants" made with that option.

Regarding --allow-requantize - personally I feel it's not necessary to make it a requirement. Instead I'd just print a suitable warning indicating that this not a "full" k-quant, in addition to the message that gets printed when a given tensor is non-256-divisible and is done using a different format.

But I'd also be fine if --alow-requantize was required. I'd either set it on every quant, if there's no negative implications to that. Or else I'd have my code detect the error from 256-divislble, and then automatically re-run with the option set.

Thanks again!

cebtenzzre · 2023-10-24T17:44:28Z

I don't think we should reuse the --alow-requantize option for this, since it is likely that a user would like to make an imperfect k-quant but does not want to accidentally pass a quantized model as input.

Clean up k-quants state passing a bit

KerfuffleV2 · 2023-10-27T13:53:48Z

I added a warning for when k-quants required a fallback. It looks like this:

llama_model_quantize_internal: model size  = 14355.96 MB
llama_model_quantize_internal: quant size  =  6340.62 MB
llama_model_quantize_internal: hist: 0.040 0.025 0.037 0.051 0.067 0.083 0.095 0.103 0.101 0.095 0.083 0.067 0.051 0.037 0.025 0.040 
llama_model_quantize_internal: WARNING: 40 of 282 tensor(s) incompatible with k-quants and required fallback quantization

Also cleaned up the k-quants quantizing state handling a bit to use a struct instead of individually passing stuff like the number of attention layers. This should also make other stateful quantization changes a lot easier in the future.

I'd either set it on every quant, if there's no negative implications to that.

I think the current plan is not to require that, but there's no harm in passing it unnecessarily if you want to. It just prevents the error you get when quantize tries to re-quantize a tensor that's already quantized. If you're quantizing f16 or f32 models then you'll never run into that condition.

TheBloke · 2023-10-27T16:25:37Z

That's awesome, thanks!

…patible (ggml-org#3747) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit

* ggml-org/llama.cpp#3747

…patible (ggml-org#3747) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit

* ggml-org/llama.cpp#3747

KerfuffleV2 added 2 commits October 27, 2023 07:48

Allow quantizing k-quants to fall back when tensor size incompatible

7f20d78

quantizing: Add warning when tensors were incompatible with k-quants

4059df1

Clean up k-quants state passing a bit

KerfuffleV2 force-pushed the feat-kquants-quantize-fallback branch from f7205bb to 4059df1 Compare October 27, 2023 13:50

ggerganov approved these changes Oct 28, 2023

View reviewed changes

ggerganov merged commit bd6d9e2 into ggml-org:master Oct 28, 2023

KerfuffleV2 deleted the feat-kquants-quantize-fallback branch November 17, 2023 03:11

brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 17, 2023

Allow kquants to fallback with incompatible tensor sizes

b2f1d2c

* ggml-org/llama.cpp#3747

brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 30, 2023

Allow kquants to fallback with incompatible tensor sizes

0a1f57a

* ggml-org/llama.cpp#3747

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow quantizing k-quants to fall back when tensor size incompatible #3747

Allow quantizing k-quants to fall back when tensor size incompatible #3747

KerfuffleV2 commented Oct 23, 2023

TheBloke commented Oct 23, 2023

KerfuffleV2 commented Oct 23, 2023

KerfuffleV2 commented Oct 23, 2023 •

edited

Loading

TheBloke commented Oct 24, 2023

cebtenzzre commented Oct 24, 2023

KerfuffleV2 commented Oct 27, 2023 •

edited

Loading

TheBloke commented Oct 27, 2023

Allow quantizing k-quants to fall back when tensor size incompatible #3747

Allow quantizing k-quants to fall back when tensor size incompatible #3747

Conversation

KerfuffleV2 commented Oct 23, 2023

TheBloke commented Oct 23, 2023

KerfuffleV2 commented Oct 23, 2023

KerfuffleV2 commented Oct 23, 2023 • edited Loading

TheBloke commented Oct 24, 2023

cebtenzzre commented Oct 24, 2023

KerfuffleV2 commented Oct 27, 2023 • edited Loading

TheBloke commented Oct 27, 2023

KerfuffleV2 commented Oct 23, 2023 •

edited

Loading

KerfuffleV2 commented Oct 27, 2023 •

edited

Loading