perf: CPU utilization is not 100% #131

jon-chuang · 2023-04-12T09:18:17Z

This suggests some issue with use of GGML or some of the recent optimizations not applied

llama-rs

llama.cpp

philpax · 2023-04-13T00:42:04Z

Are they being run with the same thread count?

jon-chuang · 2023-04-13T01:09:04Z

Yes, n=16.

jon-chuang · 2023-04-13T01:28:15Z

With LTO=true

Not clear if there is a difference. But definitely still not 100%.

A real apples to apples comparison is to measure the tokens per second.

jon-chuang · 2023-04-13T02:17:09Z

A rough comparison, and may depend on other factors like measurement, but

llama.cpp

llama_print_timings:        load time =   905.95 ms
llama_print_timings:      sample time =    81.11 ms /   128 runs   (    0.63 ms per run)
llama_print_timings: prompt eval time =   591.41 ms /     4 tokens (  147.85 ms per token)
llama_print_timings:        eval time = 61109.21 ms /   127 runs   (  481.17 ms per run)
llama_print_timings:       total time = 62098.81 ms

llama-rs

prompt_tokens: 5
predict_duration: 33739ms
predict_tokens: 133
per_token_duration: 253.677ms

Seems that llama-rs is actually doing better.

Comparing:

llama.cpp: 481.17 ms per run
llama-rs: per_token_duration: 253.677ms

jon-chuang · 2023-04-13T02:38:17Z

I got all of the params to exactly agree this time:

sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
generate: n_ctx = 2048, n_batch = 8, n_predict = 256, n_keep = 0
n_threads=16

llama.cpp

> ./main -m ../models/vicuna-13b/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.3 --color -p "### Human:" -c 2048
llama_print_timings:        load time =  1276.21 ms
llama_print_timings:      sample time =   162.74 ms /   256 runs   (    0.64 ms per run)
llama_print_timings: prompt eval time =   953.10 ms /     4 tokens (  238.28 ms per token)
llama_print_timings:        eval time = 109762.66 ms /   255 runs   (  430.44 ms per run)
llama_print_timings:       total time = 111207.18 ms

llama-rs

> cargo run --features "mmap" --release --bin llama-cli infer -m ../models/vicuna-13b/ggml-model-q4_0.bin --num-ctx-tokens 2048 -n 256 -p "### Human:"
[2023-04-13T02:30:34Z INFO  llama_cli] Infer stats:
    feed_prompt_duration: 844ms
    prompt_tokens: 5
    predict_duration: 71199ms
    predict_tokens: 261
    per_token_duration: 272.793ms

So it seems that llama-rs is about 111207/71199 or 156% faster than llama.cpp

jon-chuang · 2023-04-13T02:56:58Z

n_generated = 512:

Summary: 2x faster than llama.cpp

llama.cpp

> ./main -m ../models/vicuna-13b/ggml-model-q4_0.bin -n 512 --repeat_penalty 1.3 --color -p "### Human:" -c 2048
llama_print_timings:        load time =  1196.16 ms
llama_print_timings:      sample time =   333.42 ms /   512 runs   (    0.65 ms per run)
llama_print_timings: prompt eval time =   878.44 ms /     4 tokens (  219.61 ms per token)
llama_print_timings:        eval time = 321476.10 ms /   511 runs   (  629.11 ms per run)
llama_print_timings:       total time = 323017.46 ms

llama-rs

> cargo run --features "mmap" --release --bin llama-cli infer -m ../models/vicuna-13b/ggml-model-q4_0.bin --num-ctx-tokens 2048 -n 512 -p "### Human:"
[2023-04-13T02:55:33Z INFO  llama_cli] Infer stats:
    feed_prompt_duration: 803ms
    prompt_tokens: 5
    predict_duration: 144975ms
    predict_tokens: 517
    per_token_duration: 280.416ms

jon-chuang · 2023-04-13T03:12:38Z

ctx window = 512, generated_tokens = 256:

llama-cpp

llama_print_timings:        load time =  1049.35 ms
llama_print_timings:      sample time =   163.11 ms /   256 runs   (    0.64 ms per run)
llama_print_timings: prompt eval time =   728.29 ms /     4 tokens (  182.07 ms per token)
llama_print_timings:        eval time = 97357.60 ms /   255 runs   (  381.79 ms per run)
llama_print_timings:       total time = 98575.83 ms

llama-rs

[2023-04-13T03:05:36Z INFO  llama_cli] Infer stats:
    feed_prompt_duration: 699ms
    prompt_tokens: 5
    predict_duration: 73010ms
    predict_tokens: 261
    per_token_duration: 279.732ms

jon-chuang · 2023-04-13T05:22:38Z

I was mistaken, llama-rs uses hardware cores, so n_threads was 8, which was optimal as it seems we are compute bound.

setzer22 · 2023-04-13T06:53:07Z

Using all your logical cores is actually detrimental to performance when running CPU inference because of SIMD instructions. This happens because the way hyperthreading works relies on two threads using different parts of a core. But here, all threads want to use the exact same part of a core, since they're running the same number crunching code.

So when you set a higher thread count you see all the CPU cores at 100%, but it's slower. Because that is not actual work being done, it's just multiple threads fighting over the same piece of hardware inside the CPU cores.

jon-chuang · 2023-04-13T07:32:38Z

Strangely enough, I see that 6/16 threads rather than 8 is more optimal, and that 4 threads matches the performance of 8.

What I see is that all cores are still active even if I set the number of threads to be 4.

I believe that ggml's handling of threads may be extremely idiosyncratic. Investigating now.

Related investigations here:

jon-chuang mentioned this issue Apr 13, 2023

Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k) ggml-org/llama.cpp#842

Closed

jon-chuang mentioned this issue Apr 13, 2023

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower ggml-org/llama.cpp#932

Closed

jon-chuang closed this as completed Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: CPU utilization is not 100% #131

perf: CPU utilization is not 100% #131

jon-chuang commented Apr 12, 2023

philpax commented Apr 13, 2023

jon-chuang commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023

jon-chuang commented Apr 13, 2023

setzer22 commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 •

edited

Loading

perf: CPU utilization is not 100% #131

perf: CPU utilization is not 100% #131

Comments

jon-chuang commented Apr 12, 2023

philpax commented Apr 13, 2023

jon-chuang commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023

jon-chuang commented Apr 13, 2023

setzer22 commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading