[Enhancement] Add support for Metal inference #216

beebopkim · 2023-06-07T11:48:44Z

Please make enabled Metal inference for GGML weight models.

Now, llama.cpp can generate texts using Metal inference on Apple Silicon computers. It is very good news for M1/M2 users, and I can run LLaMA 65B GGML q4_0 model on my M1 Max computer at the speed of 4 ~ 5 tokens/s. It is awesome and really fantastic!

koboldcpp repository already has related source codes from llama.cpp like ggml-metal.h, ggml-metal.m, and ggml-metal.metal. So please make them available during inference for text generation. It would be a very special present for Apple Silicon computer users.

LostRuins · 2023-06-07T14:31:01Z

Sure, I can try that, but I will need someone who can test it for me as I don't have a mac to test myself.

I have added the flags to the makefile. Can you please try building with make LLAMA_METAL=1 and tell me if it works? If not, please detail what errors there are. Thanks

MeeTee55 · 2023-06-08T02:33:27Z

Does kobold.cpp use metal by default when you compile it with LLAMA_METAL=1? In llama.cpp you need to start main with "-ngl 1" to use it.

LostRuins · 2023-06-08T05:37:57Z

You should use the koboldcpp command line arg --gpulayers which sets the same value.

Dusterwald · 2023-06-08T07:32:59Z

I've got an M2 Max MacBook that I've just tried it on. Ran make clean followed by make LLAMA_METAL=1.
It throws the following error for me:

Undefined symbols for architecture arm64:
  "_ggml_metal_add_buffer", referenced from:
      _llama_init_from_file in gpttype_adapter_failsafe.o
  "_ggml_metal_get_tensor", referenced from:
      llama_eval_internal(llama_context&, int const*, int, int, int, char const*) in gpttype_adapter_failsafe.o
  "_ggml_metal_graph_compute", referenced from:
      llama_eval_internal(llama_context&, int const*, int, int, int, char const*) in gpttype_adapter_failsafe.o
  "_ggml_metal_init", referenced from:
      _llama_init_from_file in gpttype_adapter_failsafe.o
ld: symbol(s) not found for architecture arm64

Not sure if I'm missing something or what. This is with the latest commit to the concedo_experimental branch (commit 6635f7e).
Hope that helps @LostRuins. Happy to do more testing if needed.

LostRuins · 2023-06-08T08:29:50Z

I've made a new commit to that branch to add the flags, can you see if it works now?

Dusterwald · 2023-06-08T10:13:16Z

That fixed the build problem, thanks. Will need to evaluate the performance a bit later.

Dusterwald · 2023-06-08T11:21:35Z

Tried running it with Metal enabled (on a 65B model) using this python3 koboldcpp.py --gpulayers 80 models/VicUnlocked-Alpaca-65B.ggmlv3.q4_0.bin

But that throws an error during initialization:

llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.18 MB
llama_model_load_internal: mem required  = 38610.46 MB (+ 5120.00 MB per state)
.
llama_init_from_file: kv self size  = 5120.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '(null)'
ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid."

Unfortunately looks a bit cryptic to me, no idea what it means.

beebopkim · 2023-06-08T12:08:56Z

ggml_metal_init: using MPS
ggml_metal_init: loading '(null)'
ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid."

Unfortunately looks a bit cryptic to me, no idea what it means.

This error is happened when codes in ggml-metal.m tried to load 'ggml-metal.metal' through line 102 and 115, though ggml-metal.metal file was existed in same directory, it threw an error. I have no idea how to deal with this. Please suggest any idea to solve this problem.

101:        //NSString * path = [[NSBundle mainBundle] pathForResource:@"../../examples/metal/metal" ofType:@"metal"];
102:        NSString * path = [[NSBundle mainBundle] pathForResource:@"ggml-metal" ofType:@"metal"];
103:        fprintf(stderr, "%s: loading '%s'\n", __func__, [path UTF8String]);
104:
105:        NSString * src  = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:&error];
106:        if (error) {
107:            fprintf(stderr, "%s: error: %s\n", __func__, [[error description] UTF8String]);
108:            exit(1);
109:        }  
110:    
111:        ctx->library = [ctx->device newLibraryWithSource:src options:nil error:&error];
112:        if (error) {
113:            fprintf(stderr, "%s: error: %s\n", __func__, [[error description] UTF8String]);
114:            exit(1);
115:        }

kmilner · 2023-06-08T12:33:46Z

I'm not familiar enough with objective c to tell why the NSBundle pathForResource works in the llama.cpp repo but not here, however if you want a work around you could change those lines to use the relative path directly:
e.g. comment out line 102 and 103, and change 105 to be

NSString * src  = [NSString stringWithContentsOfFile:@"./ggml-metal.metal" encoding:NSUTF8StringEncoding error:&error];

I imagine this will break if you try to embed kobold in an NSBundle (like a .app folder), but it works for the normal local invocation.

LostRuins · 2023-06-08T14:09:27Z

Hmm I have not tried the metal implementation at all, nor have I modified any of the files. If you have recommendations to change them do let me know or send a PR, as I am not familiar with it at all.

beebopkim · 2023-06-08T16:53:23Z

Finally, I have made a pull request #221 using a patch file to change the path of ggml-metal.metal file in ggml-metal.m

Below clip is an example with LLaMA 65B GGML q4_0.

koboldcpp.metal.mp4

Dusterwald · 2023-06-08T22:17:58Z

I applied the fix manually and can confirm that it works. Getting ~3.5 T/sec on the 65B model with my M2 Max. Pretty damn good, and fast enough for chat. Prompt processing is the biggest bottleneck now, I hope that can be sped up with Metal too at some point.

LostRuins · 2023-06-09T08:12:47Z

The patch from the PR has been merged directly into the metal file itself, along with the further Metal enhancements from upstream. Can you confirm if everything works now?

Dusterwald · 2023-06-09T10:52:57Z

@LostRuins I just tried the latest commit (d28ed99). Builds okay, but crashes on run with the following:

ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading './ggml-metal.metal'
ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:92:32: error: use of undeclared identifier 'tanhf'; did you mean 'tanh'?
    dst[tpig] = 0.5f*x*(1.0f + tanhf(SQRT_2_OVER_PI*x*(1.0f + GELU_COEF_A*x*x)));
                               ^~~~~
                               tanh
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/31001/Libraries/lib/clang/31001.722/include/metal/metal_math:4662:18: note: 'tanh' declared here
METAL_FUNC float tanh(float x)
                 ^
program_source:335:28: warning: comparison of integers of different signs: 'int' and 'const uint' (aka 'const unsigned int')
        for (int i = 16; i < nth; i += 16) sum[0] += sum[i];
                         ~ ^ ~~~
program_source:905:28: warning: comparison of integers of different signs: 'int' and 'const uint' (aka 'const unsigned int')
        for (int i = 16; i < nth; i += 16) sum[0] += sum[i];
                         ~ ^ ~~~
program_source:1006:28: warning: comparison of integers of different signs: 'int' and 'const uint' (aka 'const unsigned int')
        for (int i = 16; i < nth; i += 16) sum[0] += sum[i];
                         ~ ^ ~~~
" UserInfo={NSLocalizedDescription=program_source:92:32: error: use of undeclared identifier 'tanhf'; did you mean 'tanh'?
    dst[tpig] = 0.5f*x*(1.0f + tanhf(SQRT_2_OVER_PI*x*(1.0f + GELU_COEF_A*x*x)));
                               ^~~~~
                               tanh
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/31001/Libraries/lib/clang/31001.722/include/metal/metal_math:4662:18: note: 'tanh' declared here
METAL_FUNC float tanh(float x)
                 ^
program_source:335:28: warning: comparison of integers of different signs: 'int' and 'const uint' (aka 'const unsigned int')
        for (int i = 16; i < nth; i += 16) sum[0] += sum[i];
                         ~ ^ ~~~
program_source:905:28: warning: comparison of integers of different signs: 'int' and 'const uint' (aka 'const unsigned int')
        for (int i = 16; i < nth; i += 16) sum[0] += sum[i];
                         ~ ^ ~~~
program_source:1006:28: warning: comparison of integers of different signs: 'int' and 'const uint' (aka 'const unsigned int')
        for (int i = 16; i < nth; i += 16) sum[0] += sum[i];
                         ~ ^ ~~~
}

beebopkim · 2023-06-09T12:02:31Z

@LostRuins I just tried the latest commit (d28ed99). Builds okay, but crashes on run with the following:
program_source:1006:28: warning: comparison of integers of different signs: 'int' and 'const uint' (aka 'const unsigned int')
        for (int i = 16; i < nth; i += 16) sum[0] += sum[i];
                         ~ ^ ~~~
}

Looks like ggml-metal.metal in koboldcpp has some bugs. To copy from llama.cpp's latest version will solve this bug. I guess bugs in koboldcpp will be disappeared soon as LostRuins merge latest version files from llama.cpp.

beebopkim · 2023-06-09T12:07:12Z

The patch from the PR has been merged directly into the metal file itself, along with the further Metal enhancements from upstream. Can you confirm if everything works now?

@LostRuins ggml-metal.metal has some bugs as above comment described. And the most recent version from llama.cpp solves the bugs. Please merge Apple metal related files from llama.cpp b33dee2.

LostRuins · 2023-06-09T12:31:36Z

Okay I've merged the upstream Metal changes - does it work now?

beebopkim · 2023-06-09T12:36:40Z

It works well now. 👍

Dusterwald · 2023-06-09T12:49:51Z

I can also confirm that it works now, and it does seem to be a bit faster from the new changes (around 4T/s now, up from 3.5T/s, though limited testing so it could just be variance).

KizzyCode · 2023-06-09T13:22:28Z

Ok, does not work for me:

Processing Prompt [BLAS] (200 / 200 tokens)
Generating (1 / 80 tokens)GGML_ASSERT: ggml-metal.m:629: false && "not implemented"
[1]    28456 abort      python3 koboldcpp.py --threads 8 --gpulayers 1 --port 8080 --host 127.0.0.1

Built with: make LLAMA_METAL=1
Revision: 507939c135328e15ccb1ec7e1259408197b899bf
Commandline: python3 koboldcpp.py --threads 8 --gpulayers 1 --port 8080 --host 127.0.0.1 --model ./Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_1.bin --skiplauncher
Model: M1 Pro, 8 cores, 16 GB memory
OS: macOS Ventura 13.4 (22F66)

However please note that I'm pretty new to all this AI-stuff, so it might very well be an error on my side^^

EDIT: Seems like I've been using an incompatible model again – will try with a compatible model ASAP 😅
(See also: ggml-org#1697)

EDIT 2: Works with q4_0 quantized models here too 🥳

helgur · 2023-06-09T23:29:39Z

I get:

buffer 'data' size 18300780544 is larger than buffer maximum of 17179869184 llama_init_from_file: failed to add buffer

When loading a 30b model on my 32gb m1 max. Shouldn't my hardware be able to support that a large of a model?

LostRuins · 2023-06-10T02:42:02Z

Maybe there is some overhead? Unfortunately I know nothing about the Metal implementation, so you should ask the devs in the llama.cpp issues page.

helgur · 2023-06-10T02:43:10Z

Yeah, good idea. Thanks! I'll reply here if there are any applicable updates on the matter

beebopkim · 2023-06-10T04:13:07Z

I get:

buffer 'data' size 18300780544 is larger than buffer maximum of 17179869184 llama_init_from_file: failed to add buffer

When loading a 30b model on my 32gb m1 max. Shouldn't my hardware be able to support that a large of a model?

@helgur My M1 Max has integrated memory of 64GB, and from 64GB of RAM, up to 48GB can be used as VRAM. Looks like 32GB's VRAM limit is up to 16GB.

helgur · 2023-06-10T04:14:41Z

@beebopkim ah, that explains it :(

Thanks!

LostRuins · 2023-06-10T15:06:40Z

Hey all, I just saw an upstream update that applied a similar fix for Metal: ggml-org#1782 which I merged.

Can you all see if this fix works too or does it break?

phiharri · 2023-06-10T16:13:47Z

Hey, I just pulled fb67506 and Metal works here, very cool 👍 Seeing around ~9T/s generation on 16GB M1 Pro using a 13B Q4_0 model.

beebopkim · 2023-06-11T07:44:59Z

With commit fb67506, everything is okay except f16. Now, llama.cpp supports Metal MPS inference for f16, q4_0, q4_1, q2_k, q4_k(q4_k_m), q4_k_s, and q6_k. Everything works well except f16. It looks that f16 does not work for Metal inference with commit fb67506. I tested every known quantized formats for Metal inference and only f16 is not using Metal. (I saw f16 was also working for f16 older commits before). But it is okay that it's not koboldcpp's problem. llama.cpp has exactly same problem. Though f16 is ignored Metal inference, it uses CPU and works slower anyway.

beebopkim · 2023-06-12T15:05:37Z

I confirmed that GGML F16 is not accelerated with Metal in koboldcpp.

With most recent master commit of llama.cpp, GGML F16 is using Metal inference.

LLaMA-13B-GGML-F16-llama_cpp-master-Metal.mp4

But with commit 120851d of koboldcpp, Metal inference is not working.

LLaMA-13B-GGML-F16-koboldcpp-concedo_experimental-CPU.mp4

The model weight used in this test is LLaMA 13B GGML F16.

LostRuins · 2023-06-14T03:39:01Z

To confirm, this is a known issue with upstream and not from KoboldCpp? If so, let me know when upstream has fixed it, and I will pull the fix.

beebopkim · 2023-06-17T13:27:11Z

@LostRuins For F16 Metal MPS acceleration, now llama.cpp supports it, but koboldcpp does not support it. The others - q4_0, q4_1, q2_k ~ q6_k are seen okay. I think this means the problem is from koboldcpp.

LostRuins · 2023-06-18T03:08:22Z

That is strange as I did not modify any of the metal files - so the behavior should be identical to upstream. Unfortunately I cannot debug it myself as I have no compatible mac device. If you figure out what to change, let me know and I can update it.

I think generally nobody really uses f16 format though.

beebopkim · 2023-06-18T04:20:15Z

@LostRuins What I have spotted is that, the source file including "Warning: Your model may be an OUTDATED format" is not existed in llama.cpp, and I found gpttype_adapter.cpp from koboldcpp has this string. So I guess I need to investigate gpttype_adapter.cpp soon.

beebopkim · 2023-06-30T11:51:31Z

I think I should close this discussion because Apple Metal inference is well supported on koboldcpp. ggjt v1 f16 is still not supported but it is minor issue because f16 is hardly used.

Now I found that I need to change expose.cpp and gpttype_adapter.cpp, maybe including the header files, but the code is very complex, and nowadays I am too busy to analyze and change the code soon.

I will make a new pull request for supporting Apple Metal Inference of ggjt v1 f16.

Thanks for changing to support most Apple Metal inferences. 👍

LostRuins · 2023-06-30T14:52:02Z

You're welcome. Thanks for helping to debug metal, since I cannot run it myself.

beebopkim changed the title ~~Add support for Metal inference~~ [Enhancement] Add support for Metal inference Jun 7, 2023

LostRuins added the enhancement label Jun 7, 2023

beebopkim mentioned this issue Jun 8, 2023

[Enhancement] Add support for Metal inference #221

Merged

beebopkim closed this as completed Jun 30, 2023

[Enhancement] Add support for Metal inference #216

[Enhancement] Add support for Metal inference #216

Comments

beebopkim commented Jun 7, 2023

LostRuins commented Jun 7, 2023

MeeTee55 commented Jun 8, 2023

LostRuins commented Jun 8, 2023

Dusterwald commented Jun 8, 2023

LostRuins commented Jun 8, 2023

Dusterwald commented Jun 8, 2023

Dusterwald commented Jun 8, 2023

beebopkim commented Jun 8, 2023 • edited Loading

kmilner commented Jun 8, 2023 • edited Loading

LostRuins commented Jun 8, 2023

beebopkim commented Jun 8, 2023 • edited Loading

Dusterwald commented Jun 8, 2023

LostRuins commented Jun 9, 2023

Dusterwald commented Jun 9, 2023

beebopkim commented Jun 9, 2023

beebopkim commented Jun 9, 2023

LostRuins commented Jun 9, 2023

beebopkim commented Jun 9, 2023 • edited Loading

Dusterwald commented Jun 9, 2023

KizzyCode commented Jun 9, 2023 • edited Loading

helgur commented Jun 9, 2023

LostRuins commented Jun 10, 2023

helgur commented Jun 10, 2023 • edited Loading

beebopkim commented Jun 10, 2023

helgur commented Jun 10, 2023 • edited Loading

LostRuins commented Jun 10, 2023

phiharri commented Jun 10, 2023

beebopkim commented Jun 11, 2023 • edited Loading

beebopkim commented Jun 12, 2023 • edited Loading

LostRuins commented Jun 14, 2023

beebopkim commented Jun 17, 2023 • edited Loading

LostRuins commented Jun 18, 2023

beebopkim commented Jun 18, 2023 • edited Loading

beebopkim commented Jun 30, 2023

LostRuins commented Jun 30, 2023

beebopkim commented Jun 8, 2023 •

edited

Loading

kmilner commented Jun 8, 2023 •

edited

Loading

beebopkim commented Jun 8, 2023 •

edited

Loading

beebopkim commented Jun 9, 2023 •

edited

Loading

KizzyCode commented Jun 9, 2023 •

edited

Loading

helgur commented Jun 10, 2023 •

edited

Loading

helgur commented Jun 10, 2023 •

edited

Loading

beebopkim commented Jun 11, 2023 •

edited

Loading

beebopkim commented Jun 12, 2023 •

edited

Loading

beebopkim commented Jun 17, 2023 •

edited

Loading

beebopkim commented Jun 18, 2023 •

edited

Loading