-
Notifications
You must be signed in to change notification settings - Fork 462
[Enhancement] Add support for Metal inference #216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sure, I can try that, but I will need someone who can test it for me as I don't have a mac to test myself. I have added the flags to the makefile. Can you please try building with |
Does kobold.cpp use metal by default when you compile it with LLAMA_METAL=1? In llama.cpp you need to start main with "-ngl 1" to use it. |
You should use the koboldcpp command line arg --gpulayers which sets the same value. |
I've got an M2 Max MacBook that I've just tried it on. Ran
Not sure if I'm missing something or what. This is with the latest commit to the |
I've made a new commit to that branch to add the flags, can you see if it works now? |
That fixed the build problem, thanks. Will need to evaluate the performance a bit later. |
Tried running it with Metal enabled (on a 65B model) using this But that throws an error during initialization:
Unfortunately looks a bit cryptic to me, no idea what it means. |
This error is happened when codes in ggml-metal.m tried to load 'ggml-metal.metal' through line 102 and 115, though ggml-metal.metal file was existed in same directory, it threw an error. I have no idea how to deal with this. Please suggest any idea to solve this problem.
|
I'm not familiar enough with objective c to tell why the NSBundle pathForResource works in the llama.cpp repo but not here, however if you want a work around you could change those lines to use the relative path directly:
I imagine this will break if you try to embed kobold in an NSBundle (like a .app folder), but it works for the normal local invocation. |
Hmm I have not tried the metal implementation at all, nor have I modified any of the files. If you have recommendations to change them do let me know or send a PR, as I am not familiar with it at all. |
Finally, I have made a pull request #221 using a patch file to change the path of ggml-metal.metal file in ggml-metal.m Below clip is an example with LLaMA 65B GGML q4_0. koboldcpp.metal.mp4 |
I applied the fix manually and can confirm that it works. Getting ~3.5 T/sec on the 65B model with my M2 Max. Pretty damn good, and fast enough for chat. Prompt processing is the biggest bottleneck now, I hope that can be sped up with Metal too at some point. |
The patch from the PR has been merged directly into the metal file itself, along with the further Metal enhancements from upstream. Can you confirm if everything works now? |
@LostRuins I just tried the latest commit (d28ed99). Builds okay, but crashes on run with the following:
|
Looks like ggml-metal.metal in koboldcpp has some bugs. To copy from llama.cpp's latest version will solve this bug. I guess bugs in koboldcpp will be disappeared soon as LostRuins merge latest version files from llama.cpp. |
@LostRuins ggml-metal.metal has some bugs as above comment described. And the most recent version from llama.cpp solves the bugs. Please merge Apple metal related files from llama.cpp b33dee2. |
Okay I've merged the upstream Metal changes - does it work now? |
I can also confirm that it works now, and it does seem to be a bit faster from the new changes (around 4T/s now, up from 3.5T/s, though limited testing so it could just be variance). |
Ok, does not work for me:
However please note that I'm pretty new to all this AI-stuff, so it might very well be an error on my side^^ EDIT: Seems like I've been using an incompatible model again – will try with a compatible model ASAP 😅 EDIT 2: Works with q4_0 quantized models here too 🥳 |
I get:
When loading a 30b model on my 32gb m1 max. Shouldn't my hardware be able to support that a large of a model? |
Maybe there is some overhead? Unfortunately I know nothing about the Metal implementation, so you should ask the devs in the llama.cpp issues page. |
Yeah, good idea. Thanks! I'll reply here if there are any applicable updates on the matter |
@helgur My M1 Max has integrated memory of 64GB, and from 64GB of RAM, up to 48GB can be used as VRAM. Looks like 32GB's VRAM limit is up to 16GB. |
@beebopkim ah, that explains it :( Thanks! |
Hey all, I just saw an upstream update that applied a similar fix for Metal: ggml-org#1782 which I merged. Can you all see if this fix works too or does it break? |
Hey, I just pulled fb67506 and Metal works here, very cool 👍 Seeing around ~9T/s generation on 16GB M1 Pro using a 13B Q4_0 model. |
With commit fb67506, everything is okay except f16. Now, llama.cpp supports Metal MPS inference for f16, q4_0, q4_1, q2_k, q4_k(q4_k_m), q4_k_s, and q6_k. Everything works well except f16. It looks that f16 does not work for Metal inference with commit fb67506. I tested every known quantized formats for Metal inference and only f16 is not using Metal. (I saw f16 was also working for f16 older commits before). But it is okay that it's not koboldcpp's problem. llama.cpp has exactly same problem. Though f16 is ignored Metal inference, it uses CPU and works slower anyway. |
I confirmed that GGML F16 is not accelerated with Metal in koboldcpp. With most recent master commit of llama.cpp, GGML F16 is using Metal inference. LLaMA-13B-GGML-F16-llama_cpp-master-Metal.mp4But with commit 120851d of koboldcpp, Metal inference is not working. LLaMA-13B-GGML-F16-koboldcpp-concedo_experimental-CPU.mp4The model weight used in this test is LLaMA 13B GGML F16. |
To confirm, this is a known issue with upstream and not from KoboldCpp? If so, let me know when upstream has fixed it, and I will pull the fix. |
@LostRuins For F16 Metal MPS acceleration, now llama.cpp supports it, but koboldcpp does not support it. The others - q4_0, q4_1, q2_k ~ q6_k are seen okay. I think this means the problem is from koboldcpp. |
That is strange as I did not modify any of the metal files - so the behavior should be identical to upstream. Unfortunately I cannot debug it myself as I have no compatible mac device. If you figure out what to change, let me know and I can update it. I think generally nobody really uses f16 format though. |
@LostRuins What I have spotted is that, the source file including "Warning: Your model may be an OUTDATED format" is not existed in llama.cpp, and I found gpttype_adapter.cpp from koboldcpp has this string. So I guess I need to investigate gpttype_adapter.cpp soon. ![]() ![]() |
I think I should close this discussion because Apple Metal inference is well supported on koboldcpp. Now I found that I need to change expose.cpp and gpttype_adapter.cpp, maybe including the header files, but the code is very complex, and nowadays I am too busy to analyze and change the code soon. I will make a new pull request for supporting Apple Metal Inference of Thanks for changing to support most Apple Metal inferences. 👍 |
You're welcome. Thanks for helping to debug metal, since I cannot run it myself. |
Please make enabled Metal inference for GGML weight models.
Now, llama.cpp can generate texts using Metal inference on Apple Silicon computers. It is very good news for M1/M2 users, and I can run LLaMA 65B GGML q4_0 model on my M1 Max computer at the speed of 4 ~ 5 tokens/s. It is awesome and really fantastic!
koboldcpp repository already has related source codes from llama.cpp like ggml-metal.h, ggml-metal.m, and ggml-metal.metal. So please make them available during inference for text generation. It would be a very special present for Apple Silicon computer users.
The text was updated successfully, but these errors were encountered: