-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Allow quantizing k-quants to fall back when tensor size incompatible #3747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow quantizing k-quants to fall back when tensor size incompatible #3747
Conversation
I was discussing this with concedo yesterday. The only concern is that if we put out quants that call themselves k-quants, but are mostly not k-quants, it could cause confusion? I did a quick back-of-an-envelope calculation and thought I saw that 66% of CausalLM 14B couldn't be k-quanted. Did I get that wrong? If the majority of the model is still k-quant, like your size comparison indicates, then I think this is a great change. |
That's certainly a reasonable concern. A compromise might be to fail unless
Maybe. The requirement is actually only on the first dimension of the tensor currently. So Output from loading the 14B
Per layer, there are four 5120x5120, two 5120x13696 and one 13696x5120 tensors. Only the last one can't be converted. (In terms of size a 5120x13696 or 13696x5120 is worth about 2.68 5120x5120 ones). Long story short, it's still mostly k-quants. But you can't necessarily assume other models will be comparable, the proportions will be based on how many tensors had an incompatible first dimension. |
Using ROCM and a long prompt + offloading I get segfaults that go away with It's possible it's triggering some other bug or edge case for the tensors with the second dimension not divisible by the block size. edit: Can anyone else reproduce this kind of behavior? It would be really useful to know. |
Ahh! Yes, I just looked at the tensor layout and assumed that any tensor with 13696 would be incompatible. That's much better then, and I'd certainly be happy releasing "k-quants" made with that option. Regarding --allow-requantize - personally I feel it's not necessary to make it a requirement. Instead I'd just print a suitable warning indicating that this not a "full" k-quant, in addition to the message that gets printed when a given tensor is non-256-divisible and is done using a different format. But I'd also be fine if --alow-requantize was required. I'd either set it on every quant, if there's no negative implications to that. Or else I'd have my code detect the error from 256-divislble, and then automatically re-run with the option set. Thanks again! |
I don't think we should reuse the --alow-requantize option for this, since it is likely that a user would like to make an imperfect k-quant but does not want to accidentally pass a quantized model as input. |
Clean up k-quants state passing a bit
f7205bb
to
4059df1
Compare
I added a warning for when k-quants required a fallback. It looks like this:
Also cleaned up the k-quants quantizing state handling a bit to use a struct instead of individually passing stuff like the number of attention layers. This should also make other stateful quantization changes a lot easier in the future.
I think the current plan is not to require that, but there's no harm in passing it unnecessarily if you want to. It just prevents the error you get when quantize tries to re-quantize a tensor that's already quantized. If you're quantizing |
That's awesome, thanks! |
…patible (ggml-org#3747) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit
…patible (ggml-org#3747) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit
…patible (ggml-org#3747) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit
Very simple change to allow quantizing with k-quants to fall back on a comparable choice when k-quants isn't compatible with the tensor dimensions. This will allow quantizing models like CausalLM with (mostly) k-quants.
Any reason not to do this? The fallback choices could be adjusted a bit, I went with trying to maintain quality over size.
Size comparison when quantizing CausalLM 14B to
q4_0
vsq4_k_s
with this patch: 8365107872 vs 8557804192 bytes or 8.36GB/8.55GB (7.8GiB vs8.0GiB). So pretty small, and the k-quants version should be higher quality. The converted model works perfectly, as far as I can tell.The fallback case looks like: