perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower #932

jon-chuang · 2023-04-13T02:45:15Z

Preliminary results show that llama.cpp is 1.5x-2x slower than llama-rs. They were both checked to compile with the same arch flags and use the same gnu toolchain.

Summary (on Vicuna 13B, 2048 ctx size, 256 predict tokens):

llama.cpp: 430.44 ms per run
llama-rs: per_token_duration: 272.793ms

Detailed results

An interesting observation is that CPU util is lower on llama-rs.

System Info:

llama.cpp

> make
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
I CXX:      g++ (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0

./main
system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

llama-rs

warning: Using gnu
warning: Using MAVX
warning: Using AVX2
warning: Using FMA
warning: Using F16C
warning: Using SSE3

No BLAS.

Notes: llama-rs bench runs on my branch.

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-04-13T03:06:24Z

I think I made a regression in c3ac702

Can you check that reverting it solves the issue?

jon-chuang · 2023-04-13T03:29:27Z

I checked out 9d634ef. There was no improvement, and in fact, a regression.

jon-chuang · 2023-04-13T03:43:44Z

You can see visually how much slower llama.cpp is:
llama-rs, llama.cpp

Compiling with openblas makes things even worse: 560.66 ms per run

jon-chuang · 2023-04-13T04:27:55Z

Oops, I found the issue: using hyperthreading (#34 (comment)):

Using 8 threads instead of 16:
-t 8 -n 128

llama_print_timings:        load time =  1421.72 ms
llama_print_timings:      sample time =    74.37 ms /   128 runs   (    0.58 ms per run)
llama_print_timings: prompt eval time =   720.68 ms /     4 tokens (  180.17 ms per token)
llama_print_timings:        eval time = 34420.26 ms /   127 runs   (  271.03 ms per run)
llama_print_timings:       total time = 35918.93 ms

jon-chuang · 2023-04-13T04:28:24Z

I think we should make the default number of real CPU cores. @ggerganov

MillionthOdin16 · 2023-04-13T08:38:37Z

Pretty sure the default version of the code uses like 4. Or at least the initial examples.

jon-chuang · 2023-04-13T08:47:37Z

Pretty sure the default version of the code uses like 4. Or at least the initial examples.

On linux, the default is number of logical threads:
https://github.com/ggerganov/llama.cpp/blob/e7f6997f897a18b6372a6460e25c5f89e1469f1d/examples/common.cpp#L35

MillionthOdin16 · 2023-04-13T08:50:56Z

Oh damn. That's why people are complaining when they use all their threads 😂 I guess that's the one bonus Windows has in this case.

…

On Thu, Apr 13, 2023, 04:47 jon-chuang ***@***.***> wrote: Pretty sure the default version of the code uses like 4. Or at least the initial examples. On linux, the default is number of logical threads: https://github.com/ggerganov/llama.cpp/blob/e7f6997f897a18b6372a6460e25c5f89e1469f1d/examples/common.cpp#L35 — Reply to this email directly, view it on GitHub <#932 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYMC3AAWX4UBNPI22KGQQZDXA64TJANCNFSM6AAAAAAW4OYSPU> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

jon-chuang changed the title ~~perf: Investigate performance discrepancy with llama-rs~~ perf: Investigate performance discrepancy with llama-rs - 1.5x slower Apr 13, 2023

jon-chuang changed the title ~~perf: Investigate performance discrepancy with llama-rs - 1.5x slower~~ perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower Apr 13, 2023

jon-chuang mentioned this issue Apr 13, 2023

fix(perf/UX): Use num physical cores by default, warn about E/P cores #934

Merged

ggerganov closed this as completed in #934 Apr 30, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower #932

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower #932

jon-chuang commented Apr 13, 2023 •

edited

Loading

ggerganov commented Apr 13, 2023

jon-chuang commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

MillionthOdin16 commented Apr 13, 2023

jon-chuang commented Apr 13, 2023

MillionthOdin16 commented Apr 13, 2023 via email

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower #932

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower #932

Comments

jon-chuang commented Apr 13, 2023 • edited Loading

ggerganov commented Apr 13, 2023

jon-chuang commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023 • edited Loading

MillionthOdin16 commented Apr 13, 2023

jon-chuang commented Apr 13, 2023

MillionthOdin16 commented Apr 13, 2023 via email

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading