Skip to content
This repository was archived by the owner on Jun 24, 2024. It is now read-only.

perf: CPU utilization is not 100% #131

Closed
jon-chuang opened this issue Apr 12, 2023 · 10 comments
Closed

perf: CPU utilization is not 100% #131

jon-chuang opened this issue Apr 12, 2023 · 10 comments

Comments

@jon-chuang
Copy link
Contributor

This suggests some issue with use of GGML or some of the recent optimizations not applied

llama-rs
image

llama.cpp
image

@philpax
Copy link
Collaborator

philpax commented Apr 13, 2023

Are they being run with the same thread count?

@jon-chuang
Copy link
Contributor Author

Yes, n=16.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

With LTO=true
image

Not clear if there is a difference. But definitely still not 100%.

A real apples to apples comparison is to measure the tokens per second.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

A rough comparison, and may depend on other factors like measurement, but

llama.cpp

llama_print_timings:        load time =   905.95 ms
llama_print_timings:      sample time =    81.11 ms /   128 runs   (    0.63 ms per run)
llama_print_timings: prompt eval time =   591.41 ms /     4 tokens (  147.85 ms per token)
llama_print_timings:        eval time = 61109.21 ms /   127 runs   (  481.17 ms per run)
llama_print_timings:       total time = 62098.81 ms

llama-rs

prompt_tokens: 5
predict_duration: 33739ms
predict_tokens: 133
per_token_duration: 253.677ms

Seems that llama-rs is actually doing better.

Comparing:

  • llama.cpp: 481.17 ms per run
  • llama-rs: per_token_duration: 253.677ms

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

I got all of the params to exactly agree this time:

sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
generate: n_ctx = 2048, n_batch = 8, n_predict = 256, n_keep = 0
n_threads=16

llama.cpp

> ./main -m ../models/vicuna-13b/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.3 --color -p "### Human:" -c 2048
llama_print_timings:        load time =  1276.21 ms
llama_print_timings:      sample time =   162.74 ms /   256 runs   (    0.64 ms per run)
llama_print_timings: prompt eval time =   953.10 ms /     4 tokens (  238.28 ms per token)
llama_print_timings:        eval time = 109762.66 ms /   255 runs   (  430.44 ms per run)
llama_print_timings:       total time = 111207.18 ms

llama-rs

> cargo run --features "mmap" --release --bin llama-cli infer -m ../models/vicuna-13b/ggml-model-q4_0.bin --num-ctx-tokens 2048 -n 256 -p "### Human:"
[2023-04-13T02:30:34Z INFO  llama_cli] Infer stats:
    feed_prompt_duration: 844ms
    prompt_tokens: 5
    predict_duration: 71199ms
    predict_tokens: 261
    per_token_duration: 272.793ms

So it seems that llama-rs is about 111207/71199 or 156% faster than llama.cpp

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

n_generated = 512:

Summary: 2x faster than llama.cpp

llama.cpp

> ./main -m ../models/vicuna-13b/ggml-model-q4_0.bin -n 512 --repeat_penalty 1.3 --color -p "### Human:" -c 2048
llama_print_timings:        load time =  1196.16 ms
llama_print_timings:      sample time =   333.42 ms /   512 runs   (    0.65 ms per run)
llama_print_timings: prompt eval time =   878.44 ms /     4 tokens (  219.61 ms per token)
llama_print_timings:        eval time = 321476.10 ms /   511 runs   (  629.11 ms per run)
llama_print_timings:       total time = 323017.46 ms

llama-rs

> cargo run --features "mmap" --release --bin llama-cli infer -m ../models/vicuna-13b/ggml-model-q4_0.bin --num-ctx-tokens 2048 -n 512 -p "### Human:"
[2023-04-13T02:55:33Z INFO  llama_cli] Infer stats:
    feed_prompt_duration: 803ms
    prompt_tokens: 5
    predict_duration: 144975ms
    predict_tokens: 517
    per_token_duration: 280.416ms

@jon-chuang
Copy link
Contributor Author

ctx window = 512, generated_tokens = 256:

llama-cpp

llama_print_timings:        load time =  1049.35 ms
llama_print_timings:      sample time =   163.11 ms /   256 runs   (    0.64 ms per run)
llama_print_timings: prompt eval time =   728.29 ms /     4 tokens (  182.07 ms per token)
llama_print_timings:        eval time = 97357.60 ms /   255 runs   (  381.79 ms per run)
llama_print_timings:       total time = 98575.83 ms

llama-rs

[2023-04-13T03:05:36Z INFO  llama_cli] Infer stats:
    feed_prompt_duration: 699ms
    prompt_tokens: 5
    predict_duration: 73010ms
    predict_tokens: 261
    per_token_duration: 279.732ms

@jon-chuang
Copy link
Contributor Author

I was mistaken, llama-rs uses hardware cores, so n_threads was 8, which was optimal as it seems we are compute bound.

@setzer22
Copy link
Collaborator

Using all your logical cores is actually detrimental to performance when running CPU inference because of SIMD instructions. This happens because the way hyperthreading works relies on two threads using different parts of a core. But here, all threads want to use the exact same part of a core, since they're running the same number crunching code.

So when you set a higher thread count you see all the CPU cores at 100%, but it's slower. Because that is not actual work being done, it's just multiple threads fighting over the same piece of hardware inside the CPU cores.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

Strangely enough, I see that 6/16 threads rather than 8 is more optimal, and that 4 threads matches the performance of 8.

What I see is that all cores are still active even if I set the number of threads to be 4.

I believe that ggml's handling of threads may be extremely idiosyncratic. Investigating now.

Related investigations here:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants