-
Notifications
You must be signed in to change notification settings - Fork 371
perf: CPU utilization is not 100% #131
Comments
Are they being run with the same thread count? |
Yes, n=16. |
A rough comparison, and may depend on other factors like measurement, but llama.cpp
llama-rs
Seems that llama-rs is actually doing better. Comparing:
|
I got all of the params to exactly agree this time:
llama.cpp > ./main -m ../models/vicuna-13b/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.3 --color -p "### Human:" -c 2048
llama_print_timings: load time = 1276.21 ms
llama_print_timings: sample time = 162.74 ms / 256 runs ( 0.64 ms per run)
llama_print_timings: prompt eval time = 953.10 ms / 4 tokens ( 238.28 ms per token)
llama_print_timings: eval time = 109762.66 ms / 255 runs ( 430.44 ms per run)
llama_print_timings: total time = 111207.18 ms llama-rs > cargo run --features "mmap" --release --bin llama-cli infer -m ../models/vicuna-13b/ggml-model-q4_0.bin --num-ctx-tokens 2048 -n 256 -p "### Human:"
[2023-04-13T02:30:34Z INFO llama_cli] Infer stats:
feed_prompt_duration: 844ms
prompt_tokens: 5
predict_duration: 71199ms
predict_tokens: 261
per_token_duration: 272.793ms So it seems that llama-rs is about |
Summary: 2x faster than llama.cpp llama.cpp > ./main -m ../models/vicuna-13b/ggml-model-q4_0.bin -n 512 --repeat_penalty 1.3 --color -p "### Human:" -c 2048
llama_print_timings: load time = 1196.16 ms
llama_print_timings: sample time = 333.42 ms / 512 runs ( 0.65 ms per run)
llama_print_timings: prompt eval time = 878.44 ms / 4 tokens ( 219.61 ms per token)
llama_print_timings: eval time = 321476.10 ms / 511 runs ( 629.11 ms per run)
llama_print_timings: total time = 323017.46 ms llama-rs > cargo run --features "mmap" --release --bin llama-cli infer -m ../models/vicuna-13b/ggml-model-q4_0.bin --num-ctx-tokens 2048 -n 512 -p "### Human:"
[2023-04-13T02:55:33Z INFO llama_cli] Infer stats:
feed_prompt_duration: 803ms
prompt_tokens: 5
predict_duration: 144975ms
predict_tokens: 517
per_token_duration: 280.416ms |
ctx window = 512, generated_tokens = 256: llama-cpp llama_print_timings: load time = 1049.35 ms
llama_print_timings: sample time = 163.11 ms / 256 runs ( 0.64 ms per run)
llama_print_timings: prompt eval time = 728.29 ms / 4 tokens ( 182.07 ms per token)
llama_print_timings: eval time = 97357.60 ms / 255 runs ( 381.79 ms per run)
llama_print_timings: total time = 98575.83 ms llama-rs [2023-04-13T03:05:36Z INFO llama_cli] Infer stats:
feed_prompt_duration: 699ms
prompt_tokens: 5
predict_duration: 73010ms
predict_tokens: 261
per_token_duration: 279.732ms |
I was mistaken, llama-rs uses hardware cores, so n_threads was 8, which was optimal as it seems we are compute bound. |
Using all your logical cores is actually detrimental to performance when running CPU inference because of SIMD instructions. This happens because the way hyperthreading works relies on two threads using different parts of a core. But here, all threads want to use the exact same part of a core, since they're running the same number crunching code. So when you set a higher thread count you see all the CPU cores at 100%, but it's slower. Because that is not actual work being done, it's just multiple threads fighting over the same piece of hardware inside the CPU cores. |
Strangely enough, I see that 6/16 threads rather than 8 is more optimal, and that 4 threads matches the performance of 8. What I see is that all cores are still active even if I set the number of threads to be 4. I believe that ggml's handling of threads may be extremely idiosyncratic. Investigating now. Related investigations here: |
This suggests some issue with use of GGML or some of the recent optimizations not applied
llama-rs

llama.cpp

The text was updated successfully, but these errors were encountered: