Multi-thread `ggml_cpy()` #824

ironman5366 · 2023-04-07T05:05:32Z

This is an attempt at #782.

Apologies if this is the wrong approach, learning the codebase!

Tested on Mac i7-1068NG7. Change in performance seems to be very small, if any:

Before:

========================================
perf_total_per_op_us[            NONE] =   0.000 ms
perf_total_per_op_us[             DUP] =   0.000 ms
perf_total_per_op_us[             ADD] =   0.346 ms
perf_total_per_op_us[             SUB] =   0.000 ms
perf_total_per_op_us[             MUL] =   0.492 ms
perf_total_per_op_us[             DIV] =   0.000 ms
perf_total_per_op_us[             SQR] =   0.000 ms
perf_total_per_op_us[            SQRT] =   0.000 ms
perf_total_per_op_us[             SUM] =   0.000 ms
perf_total_per_op_us[            MEAN] =   0.000 ms
perf_total_per_op_us[          REPEAT] =   0.000 ms
perf_total_per_op_us[             ABS] =   0.000 ms
perf_total_per_op_us[             SGN] =   0.000 ms
perf_total_per_op_us[             NEG] =   0.000 ms
perf_total_per_op_us[            STEP] =   0.000 ms
perf_total_per_op_us[            RELU] =   0.000 ms
perf_total_per_op_us[            GELU] =   0.000 ms
perf_total_per_op_us[            SILU] =   1.182 ms
perf_total_per_op_us[            NORM] =   0.000 ms
perf_total_per_op_us[        RMS_NORM] =   0.670 ms
perf_total_per_op_us[         MUL_MAT] = 249.842 ms
perf_total_per_op_us[           SCALE] =   0.097 ms
perf_total_per_op_us[             CPY] =   2.084 ms
perf_total_per_op_us[         RESHAPE] =   0.138 ms
perf_total_per_op_us[            VIEW] =   0.138 ms
perf_total_per_op_us[         PERMUTE] =   0.098 ms
perf_total_per_op_us[       TRANSPOSE] =   0.031 ms
perf_total_per_op_us[        GET_ROWS] =   0.003 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.037 ms
perf_total_per_op_us[        SOFT_MAX] =   0.336 ms
perf_total_per_op_us[            ROPE] =   2.216 ms
perf_total_per_op_us[      CONV_1D_1S] =   0.000 ms
perf_total_per_op_us[      CONV_1D_2S] =   0.000 ms
perf_total_per_op_us[      FLASH_ATTN] =   0.000 ms
perf_total_per_op_us[        FLASH_FF] =   0.000 ms
========================================

llama_print_timings:        load time =  2801.11 ms
llama_print_timings:      sample time =   162.00 ms /   128 runs   (    1.27 ms per run)
llama_print_timings: prompt eval time =  1553.28 ms /     7 tokens (  221.90 ms per token)
llama_print_timings:        eval time = 26238.64 ms /   127 runs   (  206.60 ms per run)
llama_print_timings:       total time = 29204.85 ms

After:

perf_total_per_op_us[            NONE] =   0.000 ms
perf_total_per_op_us[             DUP] =   0.000 ms
perf_total_per_op_us[             ADD] =   0.412 ms
perf_total_per_op_us[             SUB] =   0.000 ms
perf_total_per_op_us[             MUL] =   0.477 ms
perf_total_per_op_us[             DIV] =   0.000 ms
perf_total_per_op_us[             SQR] =   0.000 ms
perf_total_per_op_us[            SQRT] =   0.000 ms
perf_total_per_op_us[             SUM] =   0.000 ms
perf_total_per_op_us[            MEAN] =   0.000 ms
perf_total_per_op_us[          REPEAT] =   0.000 ms
perf_total_per_op_us[             ABS] =   0.000 ms
perf_total_per_op_us[             SGN] =   0.000 ms
perf_total_per_op_us[             NEG] =   0.000 ms
perf_total_per_op_us[            STEP] =   0.000 ms
perf_total_per_op_us[            RELU] =   0.000 ms
perf_total_per_op_us[            GELU] =   0.000 ms
perf_total_per_op_us[            SILU] =   1.307 ms
perf_total_per_op_us[            NORM] =   0.000 ms
perf_total_per_op_us[        RMS_NORM] =   0.658 ms
perf_total_per_op_us[         MUL_MAT] = 284.981 ms
perf_total_per_op_us[           SCALE] =   0.097 ms
perf_total_per_op_us[             CPY] =   1.749 ms
perf_total_per_op_us[         RESHAPE] =   0.142 ms
perf_total_per_op_us[            VIEW] =   0.149 ms
perf_total_per_op_us[         PERMUTE] =   0.108 ms
perf_total_per_op_us[       TRANSPOSE] =   0.034 ms
perf_total_per_op_us[        GET_ROWS] =   0.898 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.042 ms
perf_total_per_op_us[        SOFT_MAX] =   0.528 ms
perf_total_per_op_us[            ROPE] =   1.741 ms
perf_total_per_op_us[      CONV_1D_1S] =   0.000 ms
perf_total_per_op_us[      CONV_1D_2S] =   0.000 ms
perf_total_per_op_us[      FLASH_ATTN] =   0.000 ms
perf_total_per_op_us[        FLASH_FF] =   0.000 ms

llama_print_timings:        load time =  2540.09 ms
llama_print_timings:      sample time =   154.58 ms /   128 runs   (    1.21 ms per run)
llama_print_timings: prompt eval time =  1481.09 ms /     7 tokens (  211.58 ms per token)
llama_print_timings:        eval time = 24397.64 ms /   127 runs   (  192.11 ms per run)
llama_print_timings:       total time = 27095.16 ms

(The GGML_PERF outputs and the print timings outputs are separate runs, all are the 2nd invocation after building, using the same command)

Additionally, I'm having a bit of trouble testing the f16 codepath for this - I'm sure I'm doing something silly, but even when running with an f16 model, I'm not seeing src0->type ever equal GGML_TYPE_F16. Any tips for building?

Thanks!

Fs77X · 2023-04-07T06:54:43Z

Just wondering, do you get garbage output with your implementation? I've tried implementing it similar to how gmp_rope()'s multithreading was implemented and ran with the command and prompt:
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Becoming a software engineer is as easy as 5 basic steps:" -n 128

rabidcopy · 2023-04-07T11:51:16Z

Ran a comparison between master and this PR about 10 times on Linux with a Ryzen 2600.
./main -b 512 -c 2048 -n 100 -m ../llama-13b/ggml-model-q4_0.bin --seed 1
Before:

llama_print_timings:        load time =  7186.31 ms
llama_print_timings:      sample time =    65.66 ms /   100 runs   (    0.66 ms per run)
llama_print_timings: prompt eval time =  6236.80 ms /     2 tokens ( 3118.40 ms per token)
llama_print_timings:        eval time = 30004.73 ms /    99 runs   (  303.08 ms per run)
llama_print_timings:       total time = 37258.44 ms

After:

llama_print_timings:        load time = 20631.08 ms
llama_print_timings:      sample time =    66.18 ms /   100 runs   (    0.66 ms per run)
llama_print_timings: prompt eval time = 19632.83 ms /     2 tokens ( 9816.41 ms per token)
llama_print_timings:        eval time = 32962.46 ms /    99 runs   (  332.95 ms per run)
llama_print_timings:       total time = 53661.55 ms

Seems to be consistently slower in my case. Text output remained the same. On a 7B model the difference is less noticeable and it's tough to say.
Before:

llama_print_timings:        load time =  1039.46 ms
llama_print_timings:      sample time =    65.21 ms /   100 runs   (    0.65 ms per run)
llama_print_timings: prompt eval time =   291.28 ms /     2 tokens (  145.64 ms per token)
llama_print_timings:        eval time = 15080.40 ms /    99 runs   (  152.33 ms per run)
llama_print_timings:       total time = 16186.34 ms

After:

llama_print_timings:        load time =  1160.47 ms
llama_print_timings:      sample time =    65.84 ms /   100 runs   (    0.66 ms per run)
llama_print_timings: prompt eval time =   314.10 ms /     2 tokens (  157.05 ms per token)
llama_print_timings:        eval time = 15341.87 ms /    99 runs   (  154.97 ms per run)
llama_print_timings:       total time = 16569.47 ms

Testing more on 13B, still consistently getting times like 312ms on master versus 330ms/347ms/375ms on this PR.

ironman5366 · 2023-04-07T12:29:50Z

Seems to be consistently slower in my case.

Well that's not what we want! Thanks for the additional context from 13B, I'd only been trying with 7B. I'll try some further optimizations and see if it can be salvaged

rabidcopy · 2023-04-07T12:37:21Z

Seems to be consistently slower in my case.

Well that's not what we want! Thanks for the additional context from 13B, I'd only been trying with 7B. I'll try some further optimizations and see if it can be salvaged

Yeah it's strange. 7B seems to be more or less the unaffected. But a noticeable hit occurs for me with 13B. Can't test any higher than that though, not enough RAM. I did part of my testing with --mlock to ensure the model wasn't going into swap and causing inconsistencies and still the same results.

howard0su · 2023-04-07T13:24:59Z

How is the perf gain related to # of threads? I worried that memcpy is too cheap to scale to many cores. maybe testing with 2, 4, 8 to see the difference.

ggerganov · 2023-04-07T17:28:23Z

Btw, it may be possible that multi-threading the memcpy calls does not give any significant benefit, so it might not be needed after all. But it's always good to investigate and make sure

Maybe also try splitting the work via interleaving the threads.
I.e. instead of thread 0 processing rows from [0, n/t] and thread 1 processing [n/t, 2n/t], etc. You can make thread 0 process rows [0, t, 2t, 3t, ...] and thread 1 process [1, t+1, 2t + 1, ...] etc.

rabidcopy · 2023-04-07T23:09:39Z

Testing again on 13B.
Before:

llama_print_timings:        load time =  1589.07 ms
llama_print_timings:      sample time =    74.28 ms /   100 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =   626.66 ms /     2 tokens (  313.33 ms per token)
llama_print_timings:        eval time = 26578.89 ms /    99 runs   (  268.47 ms per run)
llama_print_timings:       total time = 28243.52 ms

After:

llama_print_timings:        load time =  4835.10 ms
llama_print_timings:      sample time =    76.55 ms /   100 runs   (    0.77 ms per run)
llama_print_timings: prompt eval time =  3860.21 ms /     2 tokens ( 1930.11 ms per token)
llama_print_timings:        eval time = 26631.49 ms /    99 runs   (  269.00 ms per run)
llama_print_timings:       total time = 31544.44 ms

(I dunno why it keeps giving a higher prompt eval time.)

ironman5366 · 2023-04-07T23:38:36Z

@rabidcopy you beat me to it! Was just about to post my results benchmarking. This might be a platform difference, or something that I'm missing, but for me interleaving the threads seems to have done the trick. I wrote a little benchmark runner script (using mlock like you suggested) to do 10 runs each on 7B and 13B and these are the outputs:

Before Multi-Threading:

Benchmarking 7B/ggml-model-q4_0.bin
Run 1/10 starting...
Run 1/: 178.81 ms per run
Run 2/10 starting...
Run 2/: 183.2 ms per run
Run 3/10 starting...
Run 3/: 187.4 ms per run
Run 4/10 starting...
Run 4/: 177.27 ms per run
Run 5/10 starting...
Run 5/: 191.59 ms per run
Run 6/10 starting...
Run 6/: 181.69 ms per run
Run 7/10 starting...
Run 7/: 180.46 ms per run
Run 8/10 starting...
Run 8/: 184.72 ms per run
Run 9/10 starting...
Run 9/: 184.28 ms per run
Run 10/10 starting...
Run 10/: 185.46 ms per run
Min: 177.27 ms
Max: 191.59 ms
Avg: 183.488 ms
Benchmarking 13B/ggml-model-q4_0.bin
Run 1/10 starting...
Run 1/: 483.07 ms per run
Run 2/10 starting...
Run 2/: 419.58 ms per run
Run 3/10 starting...
Run 3/: 394.25 ms per run
Run 4/10 starting...
Run 4/: 392.78 ms per run
Run 5/10 starting...
Run 5/: 385.48 ms per run
Run 6/10 starting...
Run 6/: 389.2 ms per run
Run 7/10 starting...
Run 7/: 390.36 ms per run
Run 8/10 starting...
Run 8/: 384.67 ms per run
Run 9/10 starting...
Run 9/: 372.03 ms per run
Run 10/10 starting...
Run 10/: 366.3 ms per run
Min: 366.3 ms
Max: 483.07 ms
Avg: 397.77200000000005 ms

After Multi-Threading:

Benchmarking 7B/ggml-model-q4_0.bin
Run 1/10 starting...
Run 1/: 174.15 ms per run
Run 2/10 starting...
Run 2/: 180.09 ms per run
Run 3/10 starting...
Run 3/: 179.07 ms per run
Run 4/10 starting...
Run 4/: 182.59 ms per run
Run 5/10 starting...
Run 5/: 181.67 ms per run
Run 6/10 starting...
Run 6/: 181.78 ms per run
Run 7/10 starting...
Run 7/: 183.7 ms per run
Run 8/10 starting...
Run 8/: 183.88 ms per run
Run 9/10 starting...
Run 9/: 188.41 ms per run
Run 10/10 starting...
Run 10/: 186.34 ms per run
Min: 174.15 ms
Max: 188.41 ms
Avg: 182.16799999999998 ms
Benchmarking 13B/ggml-model-q4_0.bin
Run 1/10 starting...
Run 1/: 339.03 ms per run
Run 2/10 starting...
Run 2/: 351.25 ms per run
Run 3/10 starting...
Run 3/: 352.01 ms per run
Run 4/10 starting...
Run 4/: 356.98 ms per run
Run 5/10 starting...
Run 5/: 353.85 ms per run
Run 6/10 starting...
Run 6/: 357.22 ms per run
Run 7/10 starting...
Run 7/: 361.47 ms per run
Run 8/10 starting...
Run 8/: 362.29 ms per run
Run 9/10 starting...
Run 9/: 364.97 ms per run
Run 10/10 starting...
Run 10/: 366.02 ms per run
Min: 339.03 ms
Max: 366.02 ms
Avg: 356.50900000000007 ms

Regardless, the perf difference seems very minimal.

I think there are maybe minor optimizations that can be done further here but I think it would definitely be a case of diminishing returns.

I haven't gotten a chance to test with different thread counts as Howard suggested yet, planning to try that tomorrow.

rabidcopy · 2023-04-08T00:16:24Z

Tested just on 13B.
Before:

Run 1/10 starting...
Run 1/: 277.26 ms per run
Run 2/10 starting...
Run 2/: 271.91 ms per run
Run 3/10 starting...
Run 3/: 267.1 ms per run
Run 4/10 starting...
Run 4/: 334.97 ms per run
Run 5/10 starting...
Run 5/: 276.55 ms per run
Run 6/10 starting...
Run 6/: 269.92 ms per run
Run 7/10 starting...
Run 7/: 274.91 ms per run
Run 8/10 starting...
Run 8/: 271.8 ms per run
Run 9/10 starting...
Run 9/: 274.23 ms per run
Run 10/10 starting...
Run 10/: 274.71 ms per run
Min: 267.1 ms
Max: 334.97 ms
Avg: 279.33600000000007 ms

After:

Run 1/10 starting...
Run 1/: 280.65 ms per run
Run 2/10 starting...
Run 2/: 274.67 ms per run
Run 3/10 starting...
Run 3/: 294.34 ms per run
Run 4/10 starting...
Run 4/: 286.44 ms per run
Run 5/10 starting...
Run 5/: 269.75 ms per run
Run 6/10 starting...
Run 6/: 276.93 ms per run
Run 7/10 starting...
Run 7/: 278.48 ms per run
Run 8/10 starting...
Run 8/: 276.63 ms per run
Run 9/10 starting...
Run 9/: 272.17 ms per run
Run 10/10 starting...
Run 10/: 290.9 ms per run
Min: 269.75 ms
Max: 294.34 ms
Avg: 280.096 ms

Still kinda hard to say, but at the very least there's no negative consequences now. So if it helps on other system in some cases that's good I think.

Royalphax

If I understand well, ggml_compute_forward_cpy is equivalent to ggml_compute_forward_dup; thus, multithreading the CPY operation means that the DUP operation is multithreaded too.

In that case, shouldn't line 9384 in ggml_graph_compute be modified this way ? :
From :

case GGML_OP_DUP:
{
    node->n_tasks = 1;
} break;

To :

case GGML_OP_DUP:
{
    node->n_tasks = n_threads;
} break;

Fabio3rs · 2023-04-10T00:49:31Z

ggml.c


-                        *(float *) dst_ptr = GGML_FP16_TO_FP32(*(const ggml_fp16_t *) src0_ptr);
+                        // Interleave execution so that in a 4 thread run thread 0 copies regions 0,4,8, ...
+                        if ((region_index++ % total_threads) == thread_num) {


It is possible to change how the loop counts to achieve the same without this branch?

Making the threads reading/writing contiguous memory regions theoretically can have a better performance because how the branch prediction and memory chunks loading mechanisms (cache line size for example) works.

ironman5366 · 2023-04-10T20:03:29Z

Closing this with the update in #782

ironman5366 added 2 commits April 6, 2023 23:46

Impl for 32

32d0654

WIP on f16

1b6fd54

ironman5366 marked this pull request as ready for review April 7, 2023 05:36

Interleave threads

5e5a653

Royalphax reviewed Apr 9, 2023

View reviewed changes

Fabio3rs reviewed Apr 10, 2023

View reviewed changes

ironman5366 closed this Apr 10, 2023

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-thread `ggml_cpy()` #824

Multi-thread `ggml_cpy()` #824

ironman5366 commented Apr 7, 2023 •

edited

Loading

Fs77X commented Apr 7, 2023

rabidcopy commented Apr 7, 2023 •

edited

Loading

ironman5366 commented Apr 7, 2023

rabidcopy commented Apr 7, 2023

howard0su commented Apr 7, 2023

ggerganov commented Apr 7, 2023

rabidcopy commented Apr 7, 2023

ironman5366 commented Apr 7, 2023

rabidcopy commented Apr 8, 2023

Royalphax left a comment •

edited

Loading

Fabio3rs Apr 10, 2023 •

edited

Loading

ironman5366 commented Apr 10, 2023

Multi-thread ggml_cpy() #824

Multi-thread ggml_cpy() #824

Conversation

ironman5366 commented Apr 7, 2023 • edited Loading

Fs77X commented Apr 7, 2023

rabidcopy commented Apr 7, 2023 • edited Loading

ironman5366 commented Apr 7, 2023

rabidcopy commented Apr 7, 2023

howard0su commented Apr 7, 2023

ggerganov commented Apr 7, 2023

rabidcopy commented Apr 7, 2023

ironman5366 commented Apr 7, 2023

rabidcopy commented Apr 8, 2023

Royalphax left a comment • edited Loading

Choose a reason for hiding this comment

Fabio3rs Apr 10, 2023 • edited Loading

Choose a reason for hiding this comment

ironman5366 commented Apr 10, 2023

Multi-thread `ggml_cpy()` #824

Multi-thread `ggml_cpy()` #824

ironman5366 commented Apr 7, 2023 •

edited

Loading

rabidcopy commented Apr 7, 2023 •

edited

Loading

Royalphax left a comment •

edited

Loading

Fabio3rs Apr 10, 2023 •

edited

Loading