Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-thread ggml_cpy() #824

Closed
wants to merge 3 commits into from

Conversation

ironman5366
Copy link

@ironman5366 ironman5366 commented Apr 7, 2023

This is an attempt at #782.

Apologies if this is the wrong approach, learning the codebase!

Tested on Mac i7-1068NG7. Change in performance seems to be very small, if any:

Before:

========================================
perf_total_per_op_us[            NONE] =   0.000 ms
perf_total_per_op_us[             DUP] =   0.000 ms
perf_total_per_op_us[             ADD] =   0.346 ms
perf_total_per_op_us[             SUB] =   0.000 ms
perf_total_per_op_us[             MUL] =   0.492 ms
perf_total_per_op_us[             DIV] =   0.000 ms
perf_total_per_op_us[             SQR] =   0.000 ms
perf_total_per_op_us[            SQRT] =   0.000 ms
perf_total_per_op_us[             SUM] =   0.000 ms
perf_total_per_op_us[            MEAN] =   0.000 ms
perf_total_per_op_us[          REPEAT] =   0.000 ms
perf_total_per_op_us[             ABS] =   0.000 ms
perf_total_per_op_us[             SGN] =   0.000 ms
perf_total_per_op_us[             NEG] =   0.000 ms
perf_total_per_op_us[            STEP] =   0.000 ms
perf_total_per_op_us[            RELU] =   0.000 ms
perf_total_per_op_us[            GELU] =   0.000 ms
perf_total_per_op_us[            SILU] =   1.182 ms
perf_total_per_op_us[            NORM] =   0.000 ms
perf_total_per_op_us[        RMS_NORM] =   0.670 ms
perf_total_per_op_us[         MUL_MAT] = 249.842 ms
perf_total_per_op_us[           SCALE] =   0.097 ms
perf_total_per_op_us[             CPY] =   2.084 ms
perf_total_per_op_us[         RESHAPE] =   0.138 ms
perf_total_per_op_us[            VIEW] =   0.138 ms
perf_total_per_op_us[         PERMUTE] =   0.098 ms
perf_total_per_op_us[       TRANSPOSE] =   0.031 ms
perf_total_per_op_us[        GET_ROWS] =   0.003 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.037 ms
perf_total_per_op_us[        SOFT_MAX] =   0.336 ms
perf_total_per_op_us[            ROPE] =   2.216 ms
perf_total_per_op_us[      CONV_1D_1S] =   0.000 ms
perf_total_per_op_us[      CONV_1D_2S] =   0.000 ms
perf_total_per_op_us[      FLASH_ATTN] =   0.000 ms
perf_total_per_op_us[        FLASH_FF] =   0.000 ms
========================================
llama_print_timings:        load time =  2801.11 ms
llama_print_timings:      sample time =   162.00 ms /   128 runs   (    1.27 ms per run)
llama_print_timings: prompt eval time =  1553.28 ms /     7 tokens (  221.90 ms per token)
llama_print_timings:        eval time = 26238.64 ms /   127 runs   (  206.60 ms per run)
llama_print_timings:       total time = 29204.85 ms

After:

perf_total_per_op_us[            NONE] =   0.000 ms
perf_total_per_op_us[             DUP] =   0.000 ms
perf_total_per_op_us[             ADD] =   0.412 ms
perf_total_per_op_us[             SUB] =   0.000 ms
perf_total_per_op_us[             MUL] =   0.477 ms
perf_total_per_op_us[             DIV] =   0.000 ms
perf_total_per_op_us[             SQR] =   0.000 ms
perf_total_per_op_us[            SQRT] =   0.000 ms
perf_total_per_op_us[             SUM] =   0.000 ms
perf_total_per_op_us[            MEAN] =   0.000 ms
perf_total_per_op_us[          REPEAT] =   0.000 ms
perf_total_per_op_us[             ABS] =   0.000 ms
perf_total_per_op_us[             SGN] =   0.000 ms
perf_total_per_op_us[             NEG] =   0.000 ms
perf_total_per_op_us[            STEP] =   0.000 ms
perf_total_per_op_us[            RELU] =   0.000 ms
perf_total_per_op_us[            GELU] =   0.000 ms
perf_total_per_op_us[            SILU] =   1.307 ms
perf_total_per_op_us[            NORM] =   0.000 ms
perf_total_per_op_us[        RMS_NORM] =   0.658 ms
perf_total_per_op_us[         MUL_MAT] = 284.981 ms
perf_total_per_op_us[           SCALE] =   0.097 ms
perf_total_per_op_us[             CPY] =   1.749 ms
perf_total_per_op_us[         RESHAPE] =   0.142 ms
perf_total_per_op_us[            VIEW] =   0.149 ms
perf_total_per_op_us[         PERMUTE] =   0.108 ms
perf_total_per_op_us[       TRANSPOSE] =   0.034 ms
perf_total_per_op_us[        GET_ROWS] =   0.898 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.042 ms
perf_total_per_op_us[        SOFT_MAX] =   0.528 ms
perf_total_per_op_us[            ROPE] =   1.741 ms
perf_total_per_op_us[      CONV_1D_1S] =   0.000 ms
perf_total_per_op_us[      CONV_1D_2S] =   0.000 ms
perf_total_per_op_us[      FLASH_ATTN] =   0.000 ms
perf_total_per_op_us[        FLASH_FF] =   0.000 ms
llama_print_timings:        load time =  2540.09 ms
llama_print_timings:      sample time =   154.58 ms /   128 runs   (    1.21 ms per run)
llama_print_timings: prompt eval time =  1481.09 ms /     7 tokens (  211.58 ms per token)
llama_print_timings:        eval time = 24397.64 ms /   127 runs   (  192.11 ms per run)
llama_print_timings:       total time = 27095.16 ms

(The GGML_PERF outputs and the print timings outputs are separate runs, all are the 2nd invocation after building, using the same command)

Additionally, I'm having a bit of trouble testing the f16 codepath for this - I'm sure I'm doing something silly, but even when running with an f16 model, I'm not seeing src0->type ever equal GGML_TYPE_F16. Any tips for building?

Thanks!

@ironman5366 ironman5366 marked this pull request as ready for review April 7, 2023 05:36
@Fs77X
Copy link

Fs77X commented Apr 7, 2023

Just wondering, do you get garbage output with your implementation? I've tried implementing it similar to how gmp_rope()'s multithreading was implemented and ran with the command and prompt:
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Becoming a software engineer is as easy as 5 basic steps:" -n 128

@rabidcopy
Copy link
Contributor

rabidcopy commented Apr 7, 2023

Ran a comparison between master and this PR about 10 times on Linux with a Ryzen 2600.
./main -b 512 -c 2048 -n 100 -m ../llama-13b/ggml-model-q4_0.bin --seed 1
Before:

llama_print_timings:        load time =  7186.31 ms
llama_print_timings:      sample time =    65.66 ms /   100 runs   (    0.66 ms per run)
llama_print_timings: prompt eval time =  6236.80 ms /     2 tokens ( 3118.40 ms per token)
llama_print_timings:        eval time = 30004.73 ms /    99 runs   (  303.08 ms per run)
llama_print_timings:       total time = 37258.44 ms

After:

llama_print_timings:        load time = 20631.08 ms
llama_print_timings:      sample time =    66.18 ms /   100 runs   (    0.66 ms per run)
llama_print_timings: prompt eval time = 19632.83 ms /     2 tokens ( 9816.41 ms per token)
llama_print_timings:        eval time = 32962.46 ms /    99 runs   (  332.95 ms per run)
llama_print_timings:       total time = 53661.55 ms

Seems to be consistently slower in my case. Text output remained the same. On a 7B model the difference is less noticeable and it's tough to say.
Before:

llama_print_timings:        load time =  1039.46 ms
llama_print_timings:      sample time =    65.21 ms /   100 runs   (    0.65 ms per run)
llama_print_timings: prompt eval time =   291.28 ms /     2 tokens (  145.64 ms per token)
llama_print_timings:        eval time = 15080.40 ms /    99 runs   (  152.33 ms per run)
llama_print_timings:       total time = 16186.34 ms

After:

llama_print_timings:        load time =  1160.47 ms
llama_print_timings:      sample time =    65.84 ms /   100 runs   (    0.66 ms per run)
llama_print_timings: prompt eval time =   314.10 ms /     2 tokens (  157.05 ms per token)
llama_print_timings:        eval time = 15341.87 ms /    99 runs   (  154.97 ms per run)
llama_print_timings:       total time = 16569.47 ms

Testing more on 13B, still consistently getting times like 312ms on master versus 330ms/347ms/375ms on this PR.

@ironman5366
Copy link
Author

Seems to be consistently slower in my case.

Well that's not what we want! Thanks for the additional context from 13B, I'd only been trying with 7B. I'll try some further optimizations and see if it can be salvaged

@rabidcopy
Copy link
Contributor

Seems to be consistently slower in my case.

Well that's not what we want! Thanks for the additional context from 13B, I'd only been trying with 7B. I'll try some further optimizations and see if it can be salvaged

Yeah it's strange. 7B seems to be more or less the unaffected. But a noticeable hit occurs for me with 13B. Can't test any higher than that though, not enough RAM. I did part of my testing with --mlock to ensure the model wasn't going into swap and causing inconsistencies and still the same results.

@howard0su
Copy link
Collaborator

How is the perf gain related to # of threads? I worried that memcpy is too cheap to scale to many cores. maybe testing with 2, 4, 8 to see the difference.

@ggerganov
Copy link
Member

Btw, it may be possible that multi-threading the memcpy calls does not give any significant benefit, so it might not be needed after all. But it's always good to investigate and make sure

Maybe also try splitting the work via interleaving the threads.
I.e. instead of thread 0 processing rows from [0, n/t] and thread 1 processing [n/t, 2n/t], etc. You can make thread 0 process rows [0, t, 2t, 3t, ...] and thread 1 process [1, t+1, 2t + 1, ...] etc.

@rabidcopy
Copy link
Contributor

Testing again on 13B.
Before:

llama_print_timings:        load time =  1589.07 ms
llama_print_timings:      sample time =    74.28 ms /   100 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =   626.66 ms /     2 tokens (  313.33 ms per token)
llama_print_timings:        eval time = 26578.89 ms /    99 runs   (  268.47 ms per run)
llama_print_timings:       total time = 28243.52 ms

After:

llama_print_timings:        load time =  4835.10 ms
llama_print_timings:      sample time =    76.55 ms /   100 runs   (    0.77 ms per run)
llama_print_timings: prompt eval time =  3860.21 ms /     2 tokens ( 1930.11 ms per token)
llama_print_timings:        eval time = 26631.49 ms /    99 runs   (  269.00 ms per run)
llama_print_timings:       total time = 31544.44 ms

(I dunno why it keeps giving a higher prompt eval time.)

@ironman5366
Copy link
Author

@rabidcopy you beat me to it! Was just about to post my results benchmarking. This might be a platform difference, or something that I'm missing, but for me interleaving the threads seems to have done the trick. I wrote a little benchmark runner script (using mlock like you suggested) to do 10 runs each on 7B and 13B and these are the outputs:

Before Multi-Threading:

Benchmarking 7B/ggml-model-q4_0.bin
Run 1/10 starting...
Run 1/: 178.81 ms per run
Run 2/10 starting...
Run 2/: 183.2 ms per run
Run 3/10 starting...
Run 3/: 187.4 ms per run
Run 4/10 starting...
Run 4/: 177.27 ms per run
Run 5/10 starting...
Run 5/: 191.59 ms per run
Run 6/10 starting...
Run 6/: 181.69 ms per run
Run 7/10 starting...
Run 7/: 180.46 ms per run
Run 8/10 starting...
Run 8/: 184.72 ms per run
Run 9/10 starting...
Run 9/: 184.28 ms per run
Run 10/10 starting...
Run 10/: 185.46 ms per run
Min: 177.27 ms
Max: 191.59 ms
Avg: 183.488 ms
Benchmarking 13B/ggml-model-q4_0.bin
Run 1/10 starting...
Run 1/: 483.07 ms per run
Run 2/10 starting...
Run 2/: 419.58 ms per run
Run 3/10 starting...
Run 3/: 394.25 ms per run
Run 4/10 starting...
Run 4/: 392.78 ms per run
Run 5/10 starting...
Run 5/: 385.48 ms per run
Run 6/10 starting...
Run 6/: 389.2 ms per run
Run 7/10 starting...
Run 7/: 390.36 ms per run
Run 8/10 starting...
Run 8/: 384.67 ms per run
Run 9/10 starting...
Run 9/: 372.03 ms per run
Run 10/10 starting...
Run 10/: 366.3 ms per run
Min: 366.3 ms
Max: 483.07 ms
Avg: 397.77200000000005 ms

After Multi-Threading:

Benchmarking 7B/ggml-model-q4_0.bin
Run 1/10 starting...
Run 1/: 174.15 ms per run
Run 2/10 starting...
Run 2/: 180.09 ms per run
Run 3/10 starting...
Run 3/: 179.07 ms per run
Run 4/10 starting...
Run 4/: 182.59 ms per run
Run 5/10 starting...
Run 5/: 181.67 ms per run
Run 6/10 starting...
Run 6/: 181.78 ms per run
Run 7/10 starting...
Run 7/: 183.7 ms per run
Run 8/10 starting...
Run 8/: 183.88 ms per run
Run 9/10 starting...
Run 9/: 188.41 ms per run
Run 10/10 starting...
Run 10/: 186.34 ms per run
Min: 174.15 ms
Max: 188.41 ms
Avg: 182.16799999999998 ms
Benchmarking 13B/ggml-model-q4_0.bin
Run 1/10 starting...
Run 1/: 339.03 ms per run
Run 2/10 starting...
Run 2/: 351.25 ms per run
Run 3/10 starting...
Run 3/: 352.01 ms per run
Run 4/10 starting...
Run 4/: 356.98 ms per run
Run 5/10 starting...
Run 5/: 353.85 ms per run
Run 6/10 starting...
Run 6/: 357.22 ms per run
Run 7/10 starting...
Run 7/: 361.47 ms per run
Run 8/10 starting...
Run 8/: 362.29 ms per run
Run 9/10 starting...
Run 9/: 364.97 ms per run
Run 10/10 starting...
Run 10/: 366.02 ms per run
Min: 339.03 ms
Max: 366.02 ms
Avg: 356.50900000000007 ms

Regardless, the perf difference seems very minimal.

I think there are maybe minor optimizations that can be done further here but I think it would definitely be a case of diminishing returns.

I haven't gotten a chance to test with different thread counts as Howard suggested yet, planning to try that tomorrow.

@rabidcopy
Copy link
Contributor

Tested just on 13B.
Before:

Run 1/10 starting...
Run 1/: 277.26 ms per run
Run 2/10 starting...
Run 2/: 271.91 ms per run
Run 3/10 starting...
Run 3/: 267.1 ms per run
Run 4/10 starting...
Run 4/: 334.97 ms per run
Run 5/10 starting...
Run 5/: 276.55 ms per run
Run 6/10 starting...
Run 6/: 269.92 ms per run
Run 7/10 starting...
Run 7/: 274.91 ms per run
Run 8/10 starting...
Run 8/: 271.8 ms per run
Run 9/10 starting...
Run 9/: 274.23 ms per run
Run 10/10 starting...
Run 10/: 274.71 ms per run
Min: 267.1 ms
Max: 334.97 ms
Avg: 279.33600000000007 ms

After:

Run 1/10 starting...
Run 1/: 280.65 ms per run
Run 2/10 starting...
Run 2/: 274.67 ms per run
Run 3/10 starting...
Run 3/: 294.34 ms per run
Run 4/10 starting...
Run 4/: 286.44 ms per run
Run 5/10 starting...
Run 5/: 269.75 ms per run
Run 6/10 starting...
Run 6/: 276.93 ms per run
Run 7/10 starting...
Run 7/: 278.48 ms per run
Run 8/10 starting...
Run 8/: 276.63 ms per run
Run 9/10 starting...
Run 9/: 272.17 ms per run
Run 10/10 starting...
Run 10/: 290.9 ms per run
Min: 269.75 ms
Max: 294.34 ms
Avg: 280.096 ms

Still kinda hard to say, but at the very least there's no negative consequences now. So if it helps on other system in some cases that's good I think.

Copy link
Contributor

@Royalphax Royalphax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand well, ggml_compute_forward_cpy is equivalent to ggml_compute_forward_dup; thus, multithreading the CPY operation means that the DUP operation is multithreaded too.

In that case, shouldn't line 9384 in ggml_graph_compute be modified this way ? :
From :

case GGML_OP_DUP:
{
    node->n_tasks = 1;
} break;

To :

case GGML_OP_DUP:
{
    node->n_tasks = n_threads;
} break;


*(float *) dst_ptr = GGML_FP16_TO_FP32(*(const ggml_fp16_t *) src0_ptr);
// Interleave execution so that in a 4 thread run thread 0 copies regions 0,4,8, ...
if ((region_index++ % total_threads) == thread_num) {
Copy link
Contributor

@Fabio3rs Fabio3rs Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to change how the loop counts to achieve the same without this branch?

Making the threads reading/writing contiguous memory regions theoretically can have a better performance because how the branch prediction and memory chunks loading mechanisms (cache line size for example) works.

@ironman5366
Copy link
Author

Closing this with the update in #782

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants