-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : use dynamic thread scheduling for matrix multiplication #6915
Conversation
I'm not sure how likely this is to be merged, but it's faster: master:
pr:
Greater than 2x speed increase here's arguments: Built with |
Judging by @Jeximo's numbers, this is a substantial improvement. But it surprises me that even after this change, you're getting a PP performance range of That's a huge gap. I can't reproduce this. And in your case, you were getting the max and min performance multiple times! With 6 threads, this doesn't seem explicable just by OS scheduling quirks. Perhaps it makes sense to wrap the invocation in something like Hyperfine, which will keep running the benchmark until the statistics stabilize. |
Probably related: #1507 |
@Jeximo I couldn't see an android performance boost. Mind sharing your arguments? If -tb is set manually to greater than 4, eg: 5, the speed dramatically decreases from 22 t/s to 7 t/s, testing tinyllama |
Yeah, I agree something is up, I just haven't figured out what it could be though. And the |
With PR: Without: |
Just a note... for my testing, I've disabled all BLASs, including Llamafile. |
It was all windows scheduling, for testing, I threw in some basic affinity code (the n*2 is because I have HT enabled, and I want' it on different physical cores)
And I'm getting tight timing results on windows now, with a 4% margin of error on PP, and sub 1% on tg. I also added timing to @slaren If I write something like my code, but without the janky "Just divide by 16...", where it properly partitions it like #1507 does, would that probably get merged in? (also, thanks for the link to 1507) If so, I'll cook that up over the weekend. |
I couldn't sleep, so I coded the idea up. |
One confounding factor to take account for is the behavior of E-Cores vs P-Cores. On newer systems a thread running on an E-Core can be significantly slower than one running on a P-Core causing a potential bottleneck not present in older systems, or systems with different scheduling behavior. The thread scheduling being OS dependent and unpredictable may make comparisons tricky, might be worth disabling efficient cores first to compare the improvement. Maybe Related: |
I'm just running a Ryzen 7600, so I'm not having the e-core/p-core situation, but it should address that issue as a side effect. The updated PR divides most MMs of a 7b model into 10000+ chunks, and threads that were previously finishing up to 4000us apart (even with prioritization that I hacked in for testing) are now finishing within about 20us of each other. And this should be a pretty clean solution that's cross platform. Instead of assuming all threads will get the same amount of execution time, just divide the larger MM into small blocks, and execute the smaller blocks. |
I removed the global, moving it to shared state, but that required moving a bit of code around to allow that. Next I'll go through and fixed build errors. It's just unused variables from some configs, I just need to address them. Maybe we should move the thread configuration (atomic_store, ggml_thread_create, set_numa_thread_affinity, etc) to it's own file? I also don't like that I had to pass the ggml_compute_state through ggml_compute_forward, but I don't see a better way of doing it. It kinda seems right to have it there, because everything needs arguments like ith and nth that originate on state. |
6eb46e2
to
54c2460
Compare
@slaren If you have a chance, could you do a review of this PR? I think it might be ready. It produces stable timing results, along with a mild speed gain, after 2.txt and the worst-case have almost all gone away. You might want to look at my change to Since github isn't showing the diff very well, the only things changed in the moved code is adding current_chunk to ggml_compute_state_shared I realized I could have avoided most of the moves by declaring functions ahead of time, hindsight is 20/20. If you'd prefer that, LMK, and I'll figure out how to return them to their original locations. I also thought about why we may be seeing a difference in speed reproducibility. I'm on windows, so my set_numa_thread_affinity is an empty function. I did some test runs where I did a hacky job of setting my thread_affinity & thread priority, and that helped stabilize the timings as well. Do you think that might explain why PEW and I had different experiences? |
Now PR seems equal to master for my device(fresh clone & patch)
No BLAS, and disabled llamafile:
Yes, likely my result is not easily reproducable as I cannot control thread affinity on Android. |
@Jeximo You may want to try doubling the number of threads, and see if that helps. The PR should handle poor OS scheduling better than master. In theory, generally you should get best performance when number of threads == number of logical processors, but master nor this PR hit that number on windows. I'm still getting better performance when number of threads == number of cores, but maybe you'll get something different. |
Ah, yeah master |
It would be very useful if the order is restored so that the diff only shows the changes, it's hard to review otherwise. |
Hello, here's a bit of feedback from a larger machine (two socket 64 cores) |
Done
I'll ponder that case. Is the speed drop in PP or TG? How did you measure the change? Can you post llama-bench results? What sized model? If you want to really dig into it, compile it with GGML_PERF and find the section of the code that says |
What numa flags are you using? For numa you probably want the same processors to always access the same fraction of the weights, so that it can be moved to the local memory of the node. |
@cpumaxx Also, would you mind testing this branch and see if it's the same/better/worse then master with NUMA on? It'll let me know if it's the locking implementation, or if it's the different scheduling that's the problem. |
13900k under WSL2:
So it's a very nice speedup for a CPU with e-cores. I didn't expect that it would also improve generation performance with 8 threads (=number of p-cores), I guess the OS scheduler is not doing a very good job here. Perplexity also looks good. I will try under native Windows as well later. I think it is important to keep the option of using the previous fixed-scheduling type though, I expect will be especially important for numa systems, and it may also help with systems with only one type of CPU core. |
@slaren Yeah, I'm doing all my testing under windows. The speedup is similar there. I think even with a cpu that's all the same cores, it's an across the board speedup. OSes don't guarantee perfect scheduling, so a blockwise approach like the PR should help in almost any case. Even on a dedicated machine, the OS will still handle interrupts on one processor, so you can't assume it will have 100% of all processors. I think the only case where this won't help is a single threaded environment. Plugging in some reasonable worst-case numbers, this is one atomic operation per 16x1 (the size of one block of a token generation vec * matrix) x 4096 (the size of many of the matrixes in a 7b param model).. so one atomic per 64k FP operations. That's pretty low. The metrics I have don't show the atomic as being a notable part of the overhead. I think there's still an issue with the PR re cpumaxx's problem, but I may have a solution to it. I'll want to see what he reports back to my questions. I'd like to totally get rid of the fixed scheduler, I think the automatic scheduler should be able to do all the work. This should be a speed neutral or better even with NUMA enabled. I'm wondering if the slowdowns he's seeing aren't scheduler related, but instead related to the changes I made to |
13900k with MSVC
Similar speedup, but worse absolute performance. |
@slaren Would you mind trying https://github.com/kunnis/llama.cpp/tree/MMThreadingPerfChangeWithMmPause as an alternative with your setup? It's a lighter weight change to ggml_graph_compute_thread_sync_node and ggml_graph_compute_thread_sync_task. It's not as fast for me as the PR version, but it's close. Maybe it's a better compromise between all use cases. |
@calculatortamer Slightly off-topic - does enabling flash-attention ( |
@slaren Yeah. Is there a formatter I need to run or something? |
The formatting needs to be done manually. Automatic formatters usually remove the vertical alignment, so they are not used. |
@ggerganov i did some speed test ./llama-bench -fa 0,1 -t 4,8 -r 2 -m Meta-Llama-3-8B-Instruct-IQ4_XS.gguf
build: bd80601 (2844) conclusion: TG same within margin of error, smaller PP (~ -5%) ./llama-bench -fa 0,1 -t 4,8 -r 5 -m ../llama.cpp/models/TinyLlama-1.1B-intermediate-step-1431k-3T.IQ4_XS.gguf,../llama.cpp/models/tinyllama-1.1b-intermediate-step-1431k-3t.Q4_K_S.gguf
conclusion: smaller TG (equal at best), smaller PP ./llama-bench -fa 0,1 -t 4,8 -r 3 -m ../llama.cpp/models/Phi-3-mini-4k-instruct.IQ4_XS.gguf,../llama.cpp/models/Phi-3-mini-4k-instruct.Q4_K_S.gguf,../llama.cpp/models/Phi-3-mini-4k-instruct.Q4_0.gguf
build: bd80601 (2844) conclusion IQ4_XS: better TG all of a sudden??? at a cost of a smaller PP maybe phi-3 mini is the perfect balance for my orange pi? rerun IQ4_XS phi3 model but with -r 10
build: bd80601 (2844) ... nevermind overall: no performance improvement, FA always made the PP smaller and TG smaller or equal at best also in phi3 Q4_0, PP increased @ 8 cores, but it's nothing when considering the quant's horrible performance compared to 4 cores tldr: no, it only reduces the PP i hope you find something interesting |
Sure. I see several spots where my IDE changed spacing around asterisks. I was watching for indenting. I'll go through the diff and see if I spot any other whitespace changes, but is there anything else I should watch out for? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some examples
Thanks. I'll knock those out. |
I fixed the style I think. I also found I'd left a bit of windows debug code I left in, and removed it. There's a patch that does a better job of something similar that's waiting to be merged in. |
@slaren What are your thoughts on my next PR being about moving everything related to numa, thread configuration, barriers, atomics, pthread into a file named ggml-threading.h Should be a lot simpler. Is there a better place to discuss this? |
I think that would be good, but you should check with @ggerganov. I think here is fine, but you can also start an issue or discussion if you want more broad feedback. |
There are a few warnings when building without llamafile:
|
I'll update the PR to address those. I thought I'd checked that case. |
Yes, it's OK to add |
…org#6915) * Just reordering some structs. * Adding in the calls to mm_pause * Passing around the state * Renaming and moving a bunch of variables around. * Extracting the logic to it's own function. * Moving some variable definitions into the chunk function. * Moving some variables around * moving src1_cont inside * Moving row_size * adding the current_chunk * Reorg the code. * Formatting to match the orig patch * starting to setup the chunking variables * Starting the buildup of the loop * The yield shouldn't be necessary. * adding the looping structure based on the chunk configuration. * Add in the re-chunking code. * Making it much more likely to rechunk. * disable resizing if numa is enabled. * Updating comments with what we've learned. * Fix formatting * Couple more formatting fixes. * More style fixes. * Fix Warnings * Going with unused because there's conditional logic that needs it. * Update ggml.c * Update ggml.c ---------
ggml-org#6915)" This reverts commit e1b40ac.
This is a draft of an idea I want feedback/thoughts/suggestions on.
I was working on a different issue, and I was having problems getting stable timing results. I ran llama-bench, and was getting pretty different results with each run. See before.txt PP timings varied between 21.46 to 32.42 t/s. TG was between 11.59 to 12.86 t/s.
So I looked at how the threading works in
ggml_compute_forward_mul_mat
. Since work is pre-dispatched to the thread, each thread will always do the same amount of work, and not adapt if one thread is stalled by the OS, that'll delay the whole operation.So what I did was break it into 16*number_of_threads slices, and had it work through the slices. I pulled all the guts of the operation out into it's own function, and used the existing logic that divides up into 16x more chunks. Each thread then grabs an ID of what the next chunk of work to do. This gave a more stable and faster timing results.
after.txt PP between 20.83-35.12 t/s and TG between 12.82-13.47 t/s.
So the range of the average went from 1.27 to 0.65, along with the speed going up. Do you have a better metric I should try for measuring this? Maybe I need to do llama-bench with -r 100 or something silly and let it run overnight, and compare the two results? I'm trying to improve my testing process so I'm more sure about the results.
What are your thoughts on the general idea of changing like this? I don't like the global I had to add, but I didn't want to add it to
ggml_compute_params
because I'd have to addatomic.h
to the header... but maybe that's okay? I'll do more tests, and test it on Linux before posting this as a PR, plus I wanna fix that global, get better timing comparisons, etc, etc.