Skip to content

LLM inference in C/C++, with Nexa AI's support for audio language model and swift binding

License

Notifications You must be signed in to change notification settings

NexaAI/llama.cpp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama.cpp

Last updated on Mar 4th, 2025.

This repo is cloned from llama.cpp commit 06c2b1561d8b882bc018554591f8c35eb04ad30e. It is compatible with llama-cpp-python commit 710e19a81284e5af0d5db93cef7a9063b3e8534f

Customize quantization group size at compilation (CPU inference only)

The only thing that is different is to add -DQK4_0 flag when cmake.

cmake -B build_cpu_g128 -DQK4_0=128
cmake --build build_cpu_g128

To quantize the model with the customized group size, run

./build_cpu_g128/bin/llama-quantize <model_path.gguf> <quantization_type>

To run the quantized model, run

./build_cpu_g128/bin/llama-cli -m <quantized_model_path.gguf>

Note:

You should make sure that the model you run is quantized to the same group size as the one you compile with. Or you'll receive a runtime error when loading the model.

About

LLM inference in C/C++, with Nexa AI's support for audio language model and swift binding

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 60.1%
  • C 19.2%
  • Cuda 7.0%
  • Python 6.4%
  • Objective-C 2.4%
  • Metal 2.4%
  • Other 2.5%