Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower #932

Closed
jon-chuang opened this issue Apr 13, 2023 · 8 comments · Fixed by #934
Closed

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower #932

jon-chuang opened this issue Apr 13, 2023 · 8 comments · Fixed by #934

Comments

@jon-chuang
Copy link
Contributor

jon-chuang commented Apr 13, 2023

Preliminary results show that llama.cpp is 1.5x-2x slower than llama-rs. They were both checked to compile with the same arch flags and use the same gnu toolchain.

Summary (on Vicuna 13B, 2048 ctx size, 256 predict tokens):

  • llama.cpp: 430.44 ms per run
  • llama-rs: per_token_duration: 272.793ms

Detailed results

An interesting observation is that CPU util is lower on llama-rs.

System Info:

llama.cpp

> make
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
I CXX:      g++ (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0

./main
system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

llama-rs

warning: Using gnu
warning: Using MAVX
warning: Using AVX2
warning: Using FMA
warning: Using F16C
warning: Using SSE3

No BLAS.

Notes: llama-rs bench runs on my branch.

@jon-chuang jon-chuang changed the title perf: Investigate performance discrepancy with llama-rs perf: Investigate performance discrepancy with llama-rs - 1.5x slower Apr 13, 2023
@jon-chuang jon-chuang changed the title perf: Investigate performance discrepancy with llama-rs - 1.5x slower perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower Apr 13, 2023
@ggerganov
Copy link
Member

I think I made a regression in c3ac702

Can you check that reverting it solves the issue?

@jon-chuang
Copy link
Contributor Author

I checked out 9d634ef. There was no improvement, and in fact, a regression.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

You can see visually how much slower llama.cpp is:
llama-rs, llama.cpp

Compiling with openblas makes things even worse: 560.66 ms per run

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

Oops, I found the issue: using hyperthreading (#34 (comment)):

Using 8 threads instead of 16:
-t 8 -n 128

llama_print_timings:        load time =  1421.72 ms
llama_print_timings:      sample time =    74.37 ms /   128 runs   (    0.58 ms per run)
llama_print_timings: prompt eval time =   720.68 ms /     4 tokens (  180.17 ms per token)
llama_print_timings:        eval time = 34420.26 ms /   127 runs   (  271.03 ms per run)
llama_print_timings:       total time = 35918.93 ms

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

I think we should make the default number of real CPU cores. @ggerganov

@MillionthOdin16
Copy link

Pretty sure the default version of the code uses like 4. Or at least the initial examples.

@jon-chuang
Copy link
Contributor Author

Pretty sure the default version of the code uses like 4. Or at least the initial examples.

On linux, the default is number of logical threads:
https://github.com/ggerganov/llama.cpp/blob/e7f6997f897a18b6372a6460e25c5f89e1469f1d/examples/common.cpp#L35

@MillionthOdin16
Copy link

MillionthOdin16 commented Apr 13, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants