-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-thread ggml_cpy()
#824
Conversation
Just wondering, do you get garbage output with your implementation? I've tried implementing it similar to how gmp_rope()'s multithreading was implemented and ran with the command and prompt: |
Ran a comparison between master and this PR about 10 times on Linux with a Ryzen 2600.
After:
Seems to be consistently slower in my case. Text output remained the same. On a 7B model the difference is less noticeable and it's tough to say.
After:
Testing more on 13B, still consistently getting times like 312ms on master versus 330ms/347ms/375ms on this PR. |
Well that's not what we want! Thanks for the additional context from 13B, I'd only been trying with 7B. I'll try some further optimizations and see if it can be salvaged |
Yeah it's strange. 7B seems to be more or less the unaffected. But a noticeable hit occurs for me with 13B. Can't test any higher than that though, not enough RAM. I did part of my testing with --mlock to ensure the model wasn't going into swap and causing inconsistencies and still the same results. |
How is the perf gain related to # of threads? I worried that memcpy is too cheap to scale to many cores. maybe testing with 2, 4, 8 to see the difference. |
Btw, it may be possible that multi-threading the Maybe also try splitting the work via interleaving the threads. |
Testing again on 13B.
After:
(I dunno why it keeps giving a higher prompt eval time.) |
@rabidcopy you beat me to it! Was just about to post my results benchmarking. This might be a platform difference, or something that I'm missing, but for me interleaving the threads seems to have done the trick. I wrote a little benchmark runner script (using mlock like you suggested) to do 10 runs each on 7B and 13B and these are the outputs: Before Multi-Threading:
After Multi-Threading:
Regardless, the perf difference seems very minimal. I think there are maybe minor optimizations that can be done further here but I think it would definitely be a case of diminishing returns. I haven't gotten a chance to test with different thread counts as Howard suggested yet, planning to try that tomorrow. |
Tested just on 13B.
After:
Still kinda hard to say, but at the very least there's no negative consequences now. So if it helps on other system in some cases that's good I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand well, ggml_compute_forward_cpy
is equivalent to ggml_compute_forward_dup
; thus, multithreading the CPY operation means that the DUP operation is multithreaded too.
In that case, shouldn't line 9384 in ggml_graph_compute
be modified this way ? :
From :
case GGML_OP_DUP:
{
node->n_tasks = 1;
} break;
To :
case GGML_OP_DUP:
{
node->n_tasks = n_threads;
} break;
|
||
*(float *) dst_ptr = GGML_FP16_TO_FP32(*(const ggml_fp16_t *) src0_ptr); | ||
// Interleave execution so that in a 4 thread run thread 0 copies regions 0,4,8, ... | ||
if ((region_index++ % total_threads) == thread_num) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to change how the loop counts to achieve the same without this branch?
Making the threads reading/writing contiguous memory regions theoretically can have a better performance because how the branch prediction and memory chunks loading mechanisms (cache line size for example) works.
Closing this with the update in #782 |
This is an attempt at #782.
Apologies if this is the wrong approach, learning the codebase!
Tested on Mac i7-1068NG7. Change in performance seems to be very small, if any:
Before:
After:
(The GGML_PERF outputs and the print timings outputs are separate runs, all are the 2nd invocation after building, using the same command)
Additionally, I'm having a bit of trouble testing the f16 codepath for this - I'm sure I'm doing something silly, but even when running with an f16 model, I'm not seeing src0->type ever equal GGML_TYPE_F16. Any tips for building?
Thanks!