Skip to content

Allow quantizing k-quants to fall back when tensor size incompatible #3747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 28, 2023

Conversation

KerfuffleV2
Copy link
Collaborator

Very simple change to allow quantizing with k-quants to fall back on a comparable choice when k-quants isn't compatible with the tensor dimensions. This will allow quantizing models like CausalLM with (mostly) k-quants.

Any reason not to do this? The fallback choices could be adjusted a bit, I went with trying to maintain quality over size.

Size comparison when quantizing CausalLM 14B to q4_0 vs q4_k_s with this patch: 8365107872 vs 8557804192 bytes or 8.36GB/8.55GB (7.8GiB vs8.0GiB). So pretty small, and the k-quants version should be higher quality. The converted model works perfectly, as far as I can tell.

The fallback case looks like:

[   7/ 363]                  blk.0.ffn_up.weight - [ 5120, 13696,     1,     1], type =   q8_0, quantizing to q4_K .. size =    71.05 MB ->    37.62 MB | hist: 
[   8/ 363]                blk.0.ffn_down.weight - [13696,  5120,     1,     1], type =   q8_0, 

get_k_quant_type : tensor cols 13696 x 5120 are not divisible by 256, required for q5_K - using fallback quantization q5_1
quantizing to q5_1 .. size =    71.05 MB ->    50.16 MB | 

@TheBloke
Copy link
Contributor

I was discussing this with concedo yesterday. The only concern is that if we put out quants that call themselves k-quants, but are mostly not k-quants, it could cause confusion?

I did a quick back-of-an-envelope calculation and thought I saw that 66% of CausalLM 14B couldn't be k-quanted. Did I get that wrong?

If the majority of the model is still k-quant, like your size comparison indicates, then I think this is a great change.

@KerfuffleV2
Copy link
Collaborator Author

The only concern is that if we put out quants that call themselves k-quants, but are mostly not k-quants, it could cause confusion?

That's certainly a reasonable concern. A compromise might be to fail unless --allow-requantize is enabled (or maybe add a dedicated option). That way people can't accidentally make a non-k-quants-k-quants GGUF file.

Did I get that wrong?

Maybe. The requirement is actually only on the first dimension of the tensor currently. So 5120, 13696 works but 13696, 5120 doesn't.

Output from loading the 14B Q4_K_S quantized CausalLM model:

llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q5_0:   36 tensors
llama_model_loader: - type q5_1:    4 tensors
llama_model_loader: - type q4_K:  237 tensors
llama_model_loader: - type q5_K:    4 tensors
llama_model_loader: - type q6_K:    1 tensors

Per layer, there are four 5120x5120, two 5120x13696 and one 13696x5120 tensors. Only the last one can't be converted. (In terms of size a 5120x13696 or 13696x5120 is worth about 2.68 5120x5120 ones). Long story short, it's still mostly k-quants.

But you can't necessarily assume other models will be comparable, the proportions will be based on how many tensors had an incompatible first dimension.

@KerfuffleV2
Copy link
Collaborator Author

KerfuffleV2 commented Oct 23, 2023

Using ROCM and a long prompt + offloading I get segfaults that go away with -nommq (or disabling offload). I think -nommq only has an effect with k-quants.

It's possible it's triggering some other bug or edge case for the tensors with the second dimension not divisible by the block size.

edit: Can anyone else reproduce this kind of behavior? It would be really useful to know.

@TheBloke
Copy link
Contributor

Per layer, there are four 5120x5120, two 5120x13696 and one 13696x5120 tensors. Only the last one can't be converted. (In terms of size a 5120x13696 or 13696x5120 is worth about 2.68 5120x5120 ones). Long story short, it's still mostly k-quants.

But you can't necessarily assume other models will be comparable, the proportions will be based on how many tensors had an incompatible first dimension.

Ahh! Yes, I just looked at the tensor layout and assumed that any tensor with 13696 would be incompatible.

That's much better then, and I'd certainly be happy releasing "k-quants" made with that option.

Regarding --allow-requantize - personally I feel it's not necessary to make it a requirement. Instead I'd just print a suitable warning indicating that this not a "full" k-quant, in addition to the message that gets printed when a given tensor is non-256-divisible and is done using a different format.

But I'd also be fine if --alow-requantize was required. I'd either set it on every quant, if there's no negative implications to that. Or else I'd have my code detect the error from 256-divislble, and then automatically re-run with the option set.

Thanks again!

@cebtenzzre
Copy link
Collaborator

I don't think we should reuse the --alow-requantize option for this, since it is likely that a user would like to make an imperfect k-quant but does not want to accidentally pass a quantized model as input.

@KerfuffleV2 KerfuffleV2 force-pushed the feat-kquants-quantize-fallback branch from f7205bb to 4059df1 Compare October 27, 2023 13:50
@KerfuffleV2
Copy link
Collaborator Author

KerfuffleV2 commented Oct 27, 2023

I added a warning for when k-quants required a fallback. It looks like this:

llama_model_quantize_internal: model size  = 14355.96 MB
llama_model_quantize_internal: quant size  =  6340.62 MB
llama_model_quantize_internal: hist: 0.040 0.025 0.037 0.051 0.067 0.083 0.095 0.103 0.101 0.095 0.083 0.067 0.051 0.037 0.025 0.040 
llama_model_quantize_internal: WARNING: 40 of 282 tensor(s) incompatible with k-quants and required fallback quantization

Also cleaned up the k-quants quantizing state handling a bit to use a struct instead of individually passing stuff like the number of attention layers. This should also make other stateful quantization changes a lot easier in the future.

I'd either set it on every quant, if there's no negative implications to that.

I think the current plan is not to require that, but there's no harm in passing it unnecessarily if you want to. It just prevents the error you get when quantize tries to re-quantize a tensor that's already quantized. If you're quantizing f16 or f32 models then you'll never run into that condition.

@TheBloke
Copy link
Contributor

That's awesome, thanks!

@ggerganov ggerganov merged commit bd6d9e2 into ggml-org:master Oct 28, 2023
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Oct 28, 2023
…patible (ggml-org#3747)

* Allow quantizing k-quants to fall back when tensor size incompatible

* quantizing: Add warning when tensors were incompatible with k-quants

Clean up k-quants state passing a bit
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Oct 28, 2023
…patible (ggml-org#3747)

* Allow quantizing k-quants to fall back when tensor size incompatible

* quantizing: Add warning when tensors were incompatible with k-quants

Clean up k-quants state passing a bit
@KerfuffleV2 KerfuffleV2 deleted the feat-kquants-quantize-fallback branch November 17, 2023 03:11
brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 17, 2023
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
…patible (ggml-org#3747)

* Allow quantizing k-quants to fall back when tensor size incompatible

* quantizing: Add warning when tensors were incompatible with k-quants

Clean up k-quants state passing a bit
brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants