13b model on M1 with 8gb RAM very slow #1500

joseph6377 · 2023-05-17T10:12:15Z

joseph6377
May 17, 2023

Hello, I am bit of a noob here.

Running 4bit quantized models on M1 with 8gb RAM. When I run the 13B model it is very slow I have tried to set mlock as true as well. Any other parameters I need to tweak.

llama.cpp: loading model from /Users/jo/Documents/llama.cpp/models/wizard-mega-13B.ggml.q4_0.bin
llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 90.75 KB
llama_model_load_internal: mem required = 9807.48 MB (+ 1608.00 MB per state)
..............................................................warning: failed to mlock 44236800-byte buffer (after previously locking 5073518592 bytes): Resource temporarily unavailable
......................................
llama_init_from_file: kv self size = 400.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

FNsi · 2023-05-17T12:14:51Z

FNsi
May 17, 2023

You see, 13b size is larger than your physical ram, so it be cache in virtual ram, thus slower.

You should use 7b for high speed.

2 replies

buzzy890 Sep 8, 2023

I have 8 GB of ram, and falcon 7 billion parameters model (4 bit quantization, GGML) took 108 seconds to answer a hello message. Everything beyond "hello" can take multiple minutes. The file size is 4 GB

FNsi Sep 8, 2023

I have 8 GB of ram, and falcon 7 billion parameters model (4 bit quantization, GGML) took 108 seconds to answer a hello message. Everything beyond "hello" can take multiple minutes. The file size is 4 GB

Try it after system just started, might the sys take too much ram.

ianscrivener · 2023-09-08T22:35:18Z

ianscrivener
Sep 8, 2023

Two things;
(1) you are using CPU (slow) not GPU (fast). Use -ngl 1
(2) You are using old version of the code and old file format (ggml)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

13b model on M1 with 8gb RAM very slow #1500

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

13b model on M1 with 8gb RAM very slow #1500

joseph6377 May 17, 2023

Replies: 2 comments · 2 replies

FNsi May 17, 2023

buzzy890 Sep 8, 2023

FNsi Sep 8, 2023

ianscrivener Sep 8, 2023

joseph6377
May 17, 2023

Replies: 2 comments 2 replies

FNsi
May 17, 2023

ianscrivener
Sep 8, 2023