ggml : use dynamic thread scheduling for matrix multiplication #6915

kunnis · 2024-04-26T00:15:00Z

This is a draft of an idea I want feedback/thoughts/suggestions on.

I was working on a different issue, and I was having problems getting stable timing results. I ran llama-bench, and was getting pretty different results with each run. See before.txt PP timings varied between 21.46 to 32.42 t/s. TG was between 11.59 to 12.86 t/s.

So I looked at how the threading works in ggml_compute_forward_mul_mat. Since work is pre-dispatched to the thread, each thread will always do the same amount of work, and not adapt if one thread is stalled by the OS, that'll delay the whole operation.
So what I did was break it into 16*number_of_threads slices, and had it work through the slices. I pulled all the guts of the operation out into it's own function, and used the existing logic that divides up into 16x more chunks. Each thread then grabs an ID of what the next chunk of work to do. This gave a more stable and faster timing results.
after.txt PP between 20.83-35.12 t/s and TG between 12.82-13.47 t/s.

So the range of the average went from 1.27 to 0.65, along with the speed going up. Do you have a better metric I should try for measuring this? Maybe I need to do llama-bench with -r 100 or something silly and let it run overnight, and compare the two results? I'm trying to improve my testing process so I'm more sure about the results.

What are your thoughts on the general idea of changing like this? I don't like the global I had to add, but I didn't want to add it to ggml_compute_params because I'd have to add atomic.h to the header... but maybe that's okay? I'll do more tests, and test it on Linux before posting this as a PR, plus I wanna fix that global, get better timing comparisons, etc, etc.

github-actions · 2024-04-26T00:29:39Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 536 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8723.78ms p(95)=21786.6ms fails=, finish reason: stop=469 truncated=67
Prompt processing (pp): avg=104.86tk/s p(95)=475.83tk/s
Token generation (tg): avg=35.74tk/s p(95)=44.8tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=MMThreadingPerfChange commit=14c104d1668716e1049fb99df121e9d60efb1926

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715793170 --> 1715793796
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 305.45, 305.45, 305.45, 305.45, 305.45, 756.98, 756.98, 756.98, 756.98, 756.98, 651.0, 651.0, 651.0, 651.0, 651.0, 673.55, 673.55, 673.55, 673.55, 673.55, 719.88, 719.88, 719.88, 719.88, 719.88, 733.76, 733.76, 733.76, 733.76, 733.76, 734.2, 734.2, 734.2, 734.2, 734.2, 752.77, 752.77, 752.77, 752.77, 752.77, 768.51, 768.51, 768.51, 768.51, 768.51, 769.16, 769.16, 769.16, 769.16, 769.16, 782.34, 782.34, 782.34, 782.34, 782.34, 809.92, 809.92, 809.92, 809.92, 809.92, 830.23, 830.23, 830.23, 830.23, 830.23, 845.04, 845.04, 845.04, 845.04, 845.04, 877.13, 877.13, 877.13, 877.13, 877.13, 779.37, 779.37, 779.37, 779.37, 779.37, 794.8, 794.8, 794.8, 794.8, 794.8, 793.68, 793.68, 793.68, 793.68, 793.68, 817.09, 817.09, 817.09, 817.09, 817.09, 813.81, 813.81, 813.81, 813.81, 813.81, 811.35, 811.35, 811.35, 811.35, 811.35, 816.32, 816.32, 816.32, 816.32, 816.32, 820.5, 820.5, 820.5, 820.5, 820.5, 818.55, 818.55, 818.55, 818.55, 818.55, 797.85, 797.85, 797.85, 797.85, 797.85, 801.32, 801.32, 801.32, 801.32, 801.32, 819.32, 819.32, 819.32, 819.32, 819.32, 817.55, 817.55, 817.55, 817.55, 817.55, 816.93, 816.93, 816.93, 816.93, 816.93, 814.78, 814.78, 814.78, 814.78, 814.78, 817.9, 817.9, 817.9, 817.9, 817.9, 821.81, 821.81, 821.81, 821.81, 821.81, 819.37, 819.37, 819.37, 819.37, 819.37, 823.81, 823.81, 823.81, 823.81, 823.81, 824.43, 824.43, 824.43, 824.43, 824.43, 833.3, 833.3, 833.3, 833.3, 833.3, 843.76, 843.76, 843.76, 843.76, 843.76, 844.62, 844.62, 844.62, 844.62, 844.62, 842.9, 842.9, 842.9, 842.9, 842.9, 841.4, 841.4, 841.4, 841.4, 841.4, 842.0, 842.0, 842.0, 842.0, 842.0, 842.02, 842.02, 842.02, 842.02, 842.02, 836.66, 836.66, 836.66, 836.66, 836.66, 806.5, 806.5, 806.5, 806.5, 806.5, 806.16, 806.16, 806.16, 806.16, 806.16, 805.53, 805.53, 805.53, 805.53, 805.53, 804.31, 804.31, 804.31, 804.31, 804.31, 808.84, 808.84, 808.84, 808.84, 808.84, 811.88, 811.88, 811.88, 811.88, 811.88, 811.95, 811.95, 811.95, 811.95, 811.95, 817.37, 817.37, 817.37, 817.37, 817.37, 817.25, 817.25, 817.25, 817.25, 817.25, 821.31, 821.31, 821.31, 821.31, 821.31, 824.04, 824.04, 824.04, 824.04, 824.04, 826.06, 826.06, 826.06, 826.06, 826.06, 830.53, 830.53, 830.53, 830.53, 830.53, 831.27, 831.27, 831.27, 831.27, 831.27, 831.11, 831.11, 831.11, 831.11, 831.11, 832.12, 832.12, 832.12, 832.12, 832.12, 833.56, 833.56, 833.56, 833.56, 833.56, 833.36, 833.36, 833.36, 833.36]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715793170 --> 1715793796
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.95, 38.95, 38.95, 38.95, 38.95, 42.68, 42.68, 42.68, 42.68, 42.68, 37.18, 37.18, 37.18, 37.18, 37.18, 30.27, 30.27, 30.27, 30.27, 30.27, 30.25, 30.25, 30.25, 30.25, 30.25, 30.91, 30.91, 30.91, 30.91, 30.91, 32.03, 32.03, 32.03, 32.03, 32.03, 32.93, 32.93, 32.93, 32.93, 32.93, 33.33, 33.33, 33.33, 33.33, 33.33, 33.25, 33.25, 33.25, 33.25, 33.25, 33.35, 33.35, 33.35, 33.35, 33.35, 33.41, 33.41, 33.41, 33.41, 33.41, 33.38, 33.38, 33.38, 33.38, 33.38, 32.6, 32.6, 32.6, 32.6, 32.6, 31.65, 31.65, 31.65, 31.65, 31.65, 31.16, 31.16, 31.16, 31.16, 31.16, 30.85, 30.85, 30.85, 30.85, 30.85, 31.21, 31.21, 31.21, 31.21, 31.21, 31.22, 31.22, 31.22, 31.22, 31.22, 30.9, 30.9, 30.9, 30.9, 30.9, 30.75, 30.75, 30.75, 30.75, 30.75, 30.73, 30.73, 30.73, 30.73, 30.73, 30.94, 30.94, 30.94, 30.94, 30.94, 31.11, 31.11, 31.11, 31.11, 31.11, 31.01, 31.01, 31.01, 31.01, 31.01, 31.1, 31.1, 31.1, 31.1, 31.1, 31.3, 31.3, 31.3, 31.3, 31.3, 31.14, 31.14, 31.14, 31.14, 31.14, 30.55, 30.55, 30.55, 30.55, 30.55, 30.6, 30.6, 30.6, 30.6, 30.6, 30.81, 30.81, 30.81, 30.81, 30.81, 31.05, 31.05, 31.05, 31.05, 31.05, 31.2, 31.2, 31.2, 31.2, 31.2, 31.34, 31.34, 31.34, 31.34, 31.34, 31.23, 31.23, 31.23, 31.23, 31.23, 31.24, 31.24, 31.24, 31.24, 31.24, 31.16, 31.16, 31.16, 31.16, 31.16, 30.76, 30.76, 30.76, 30.76, 30.76, 30.79, 30.79, 30.79, 30.79, 30.79, 30.86, 30.86, 30.86, 30.86, 30.86, 31.02, 31.02, 31.02, 31.02, 31.02, 31.08, 31.08, 31.08, 31.08, 31.08, 31.15, 31.15, 31.15, 31.15, 31.15, 31.04, 31.04, 31.04, 31.04, 31.04, 30.58, 30.58, 30.58, 30.58, 30.58, 30.54, 30.54, 30.54, 30.54, 30.54, 29.39, 29.39, 29.39, 29.39, 29.39, 29.32, 29.32, 29.32, 29.32, 29.32, 29.37, 29.37, 29.37, 29.37, 29.37, 29.5, 29.5, 29.5, 29.5, 29.5, 29.53, 29.53, 29.53, 29.53, 29.53, 29.65, 29.65, 29.65, 29.65, 29.65, 29.69, 29.69, 29.69, 29.69, 29.69, 29.65, 29.65, 29.65, 29.65, 29.65, 29.57, 29.57, 29.57, 29.57, 29.57, 29.49, 29.49, 29.49, 29.49, 29.49, 29.53, 29.53, 29.53, 29.53, 29.53, 29.7, 29.7, 29.7, 29.7, 29.7, 29.84, 29.84, 29.84, 29.84, 29.84, 29.89, 29.89, 29.89, 29.89, 29.89, 29.92, 29.92, 29.92, 29.92]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715793170 --> 1715793796
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.46, 0.46, 0.46, 0.46, 0.46, 0.23, 0.23, 0.23, 0.23, 0.23, 0.31, 0.31, 0.31, 0.31, 0.31, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.45, 0.45, 0.45, 0.45, 0.45, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.29, 0.29, 0.29, 0.29, 0.29, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.29, 0.29, 0.29, 0.29, 0.29, 0.34, 0.34, 0.34, 0.34, 0.34, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.34, 0.34, 0.34, 0.34, 0.34, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.08, 0.08, 0.08, 0.08, 0.08, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.31, 0.31, 0.31, 0.31, 0.31, 0.57, 0.57, 0.57, 0.57, 0.57, 0.5, 0.5, 0.5, 0.5, 0.5, 0.53, 0.53, 0.53, 0.53, 0.53, 0.33, 0.33, 0.33, 0.33, 0.33, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715793170 --> 1715793796
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0]

Jeximo · 2024-04-26T01:53:37Z

uname -a
Linux localhost 4.14.190-23725627-abG975WVLS8IWD1 #2 SMP PREEMPT Mon Apr 10 18:16:39 KST 2023 aarch64 Android

I'm not sure how likely this is to be merged, but it's faster:

master:

llama_print_timings:        load time =   10137.45 ms
llama_print_timings:      sample time =      30.83 ms /   159 runs   (    0.19 ms per token,  5156.98 tokens per second)
llama_print_timings: prompt eval time =   41829.56 ms /    59 tokens (  708.98 ms per token,     1.41 tokens per second)
llama_print_timings:        eval time =   59630.39 ms /   158 runs   (  377.41 ms per token,     2.65 tokens per second)
llama_print_timings:       total time =  106957.99 ms /   217 tokens

pr:

llama_print_timings:        load time =    9489.74 ms                                              
llama_print_timings:      sample time =      29.13 ms /   159 runs   (    0.18 ms per token,  5458.48 tokens per second)
llama_print_timings: prompt eval time =   18258.65 ms /    59 tokens (  309.47 ms per token,     3.23 tokens per second)
llama_print_timings:        eval time =   55050.81 ms /   158 runs   (  348.42 ms per token,     2.87 tokens per second)
llama_print_timings:       total time =   75499.96 ms /   217 tokens

Greater than 2x speed increase prompt eval time with 7b IQ_4XS on CPU

here's arguments:
./main -m ~/WaveCoder-Ultra-6.7b.IQ4_XS.gguf -ins --interactive-first --penalize-nl --in-suffix "### Response:" --temp 0 -c 2048 -t 4 -b 10 -p "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request."

Built with cmake -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod+i8mm

p-e-w · 2024-04-26T02:25:50Z

Judging by @Jeximo's numbers, this is a substantial improvement. But it surprises me that even after this change, you're getting a PP performance range of 20.83-35.12 t/s.

That's a huge gap. I can't reproduce this. And in your case, you were getting the max and min performance multiple times! With 6 threads, this doesn't seem explicable just by OS scheduling quirks.

Perhaps it makes sense to wrap the invocation in something like Hyperfine, which will keep running the benchmark until the statistics stabilize.

slaren · 2024-04-26T02:26:54Z

Probably related: #1507

BarfingLemurs · 2024-04-26T02:36:22Z

@Jeximo I couldn't see an android performance boost. Mind sharing your arguments? If -tb is set manually to greater than 4, eg: 5, the speed dramatically decreases from 22 t/s to 7 t/s, testing tinyllama

kunnis · 2024-04-26T02:48:03Z

Judging by @Jeximo's numbers, this is a substantial improvement. But it surprises me that even after this change, you're getting a PP performance range of 20.83-35.12 t/s.

That's a huge gap. I can't reproduce this. And in your case, you were getting the max and min performance multiple times! With 6 threads, this doesn't seem explicable just by OS scheduling quirks.

Perhaps it makes sense to wrap the invocation in something like Hyperfine, which will keep running the benchmark until the statistics stabilize.

Yeah, I agree something is up, I just haven't figured out what it could be though. And the 20.83-35.12 t/s. is range of averages. It means my benchmarking isn't stable, I'm aware it's an issue, I just haven't found the cause yet.

USBhost · 2024-04-26T05:27:45Z

python koboldcpp.py --threads 24 --contextsize 65536 --model ../Mixtral-8x22B-Instruct-v0.1/ggml-model-Q8_0.gguf
I'm using koboldcpp with this PR running on a AMD EPYC 7F72 server with nothing running. Inference is actually 'slower' but so small.... This PR may be good for a system that's not idle?

With PR:
Processing Prompt (158 / 158 tokens)
Generating (120 / 120 tokens)
CtxLimit: 278/65536, Process:12.77s (80.8ms/T = 12.37T/s), Generate:39.64s (330.3ms/T = 3.03T/s), Total:52.41s (2.29T/s)
Processing Prompt (135 / 135 tokens)
Generating (120 / 120 tokens)
CtxLimit: 278/65536, Process:10.96s (81.2ms/T = 12.32T/s), Generate:39.64s (330.3ms/T = 3.03T/s), Total:50.59s (2.37T/s)
Processing Prompt (132 / 132 tokens)
Generating (120 / 120 tokens)
CtxLimit: 278/65536, Process:10.73s (81.3ms/T = 12.30T/s), Generate:39.67s (330.6ms/T = 3.02T/s), Total:50.40s (2.38T/s)

Without:
Processing Prompt (158 / 158 tokens)
Generating (120 / 120 tokens)
CtxLimit: 278/65536, Process:12.73s (80.6ms/T = 12.41T/s), Generate:38.92s (324.3ms/T = 3.08T/s), Total:51.65s (2.32T/s)
Processing Prompt (135 / 135 tokens)
Generating (120 / 120 tokens)
CtxLimit: 278/65536, Process:10.91s (80.8ms/T = 12.38T/s), Generate:38.92s (324.4ms/T = 3.08T/s), Total:49.83s (2.41T/s)
Processing Prompt (135 / 135 tokens)
Generating (120 / 120 tokens)
CtxLimit: 278/65536, Process:10.92s (80.9ms/T = 12.36T/s), Generate:38.89s (324.1ms/T = 3.09T/s), Total:49.81s (2.41T/s)

kunnis · 2024-04-26T17:59:20Z

Just a note... for my testing, I've disabled all BLASs, including Llamafile.

kunnis · 2024-04-27T04:39:49Z

It was all windows scheduling, for testing, I threw in some basic affinity code (the n*2 is because I have HT enabled, and I want' it on different physical cores)

static void set_numa_thread_affinity(int thread_n) {
    UNUSED(thread_n);
    HANDLE handle = GetCurrentThread();
    SetThreadPriority(handle, THREAD_PRIORITY_ABOVE_NORMAL);
    SetThreadAffinityMask(handle, 1 << (thread_n*2));
}

And I'm getting tight timing results on windows now, with a 4% margin of error on PP, and sub 1% on tg. I also added timing to ggml_compute_forward_mul_mat_one_chunk But that means that scheduling stalls are the issue.

@slaren If I write something like my code, but without the janky "Just divide by 16...", where it properly partitions it like #1507 does, would that probably get merged in? (also, thanks for the link to 1507) If so, I'll cook that up over the weekend.

kunnis · 2024-04-27T07:42:39Z

I couldn't sleep, so I coded the idea up.

LostRuins · 2024-04-27T17:19:31Z

One confounding factor to take account for is the behavior of E-Cores vs P-Cores. On newer systems a thread running on an E-Core can be significantly slower than one running on a P-Core causing a potential bottleneck not present in older systems, or systems with different scheduling behavior.

The thread scheduling being OS dependent and unpredictable may make comparisons tricky, might be worth disabling efficient cores first to compare the improvement.

Maybe Related:
LostRuins#447

kunnis · 2024-04-27T17:40:06Z

I'm just running a Ryzen 7600, so I'm not having the e-core/p-core situation, but it should address that issue as a side effect. The updated PR divides most MMs of a 7b model into 10000+ chunks, and threads that were previously finishing up to 4000us apart (even with prioritization that I hacked in for testing) are now finishing within about 20us of each other. And this should be a pretty clean solution that's cross platform. Instead of assuming all threads will get the same amount of execution time, just divide the larger MM into small blocks, and execute the smaller blocks.

kunnis · 2024-04-27T22:16:03Z

I removed the global, moving it to shared state, but that required moving a bit of code around to allow that. Next I'll go through and fixed build errors. It's just unused variables from some configs, I just need to address them.

Maybe we should move the thread configuration (atomic_store, ggml_thread_create, set_numa_thread_affinity, etc) to it's own file?

I also don't like that I had to pass the ggml_compute_state through ggml_compute_forward, but I don't see a better way of doing it. It kinda seems right to have it there, because everything needs arguments like ith and nth that originate on state.

kunnis · 2024-04-28T05:35:15Z

@slaren If you have a chance, could you do a review of this PR? I think it might be ready. It produces stable timing results, along with a mild speed gain, after 2.txt and the worst-case have almost all gone away. You might want to look at my change to ggml_graph_compute_thread_sync_node, I found it added to the speed boost as well, but maybe that's too much for this PR. I had to move several methods around in the file so I could pass the ggml_compute_state around. I think that's reasonable to pass around, because we end up passing around several of the values on the ggml_compute_state... we should just pass that around instead. But maybe that's the wrong call? Do you mind the slightly longer variable names? The short variable names like ir110 start getting confusing to me.

Since github isn't showing the diff very well, the only things changed in the moved code is adding current_chunk to ggml_compute_state_shared I realized I could have avoided most of the moves by declaring functions ahead of time, hindsight is 20/20. If you'd prefer that, LMK, and I'll figure out how to return them to their original locations.

I also thought about why we may be seeing a difference in speed reproducibility. I'm on windows, so my set_numa_thread_affinity is an empty function. I did some test runs where I did a hacky job of setting my thread_affinity & thread priority, and that helped stabilize the timings as well. Do you think that might explain why PEW and I had different experiences?

Jeximo · 2024-04-28T10:37:10Z

Now PR seems equal to master for my device(fresh clone & patch)

llama_print_timings:        load time =    3224.59 ms
llama_print_timings:      sample time =      38.06 ms /   217 runs   (    0.18 ms per token,  5700.92 tokens per second)
llama_print_timings: prompt eval time =   40340.46 ms /    60 tokens (  672.34 ms per token,     1.49 tokens per second)
llama_print_timings:        eval time =   71808.87 ms /   216 runs   (  332.45 ms per token,     3.01 tokens per second)
llama_print_timings:       total time =  114604.73 ms /   276 tokens

Just a note... for my testing, I've disabled all BLASs, including Llamafile.

The thread scheduling being OS dependent and unpredictable may make comparisons tricky

Yes, likely my result is not easily reproducable as I cannot control thread affinity on Android.

kunnis · 2024-04-28T21:31:07Z

@Jeximo You may want to try doubling the number of threads, and see if that helps. The PR should handle poor OS scheduling better than master. In theory, generally you should get best performance when number of threads == number of logical processors, but master nor this PR hit that number on windows. I'm still getting better performance when number of threads == number of cores, but maybe you'll get something different.

Jeximo · 2024-04-29T11:06:58Z

The PR should handle poor OS scheduling better than master.

Ah, yeah master llama.cpp w/ -t 8 normally kills my OS - PR actually allows tokenization(not faster, but functional).

slaren · 2024-04-29T17:18:10Z

Since github isn't showing the diff very well, the only things changed in the moved code is adding current_chunk to ggml_compute_state_shared I realized I could have avoided most of the moves by declaring functions ahead of time, hindsight is 20/20. If you'd prefer that, LMK, and I'll figure out how to return them to their original locations.

It would be very useful if the order is restored so that the diff only shows the changes, it's hard to review otherwise.

cpumaxx · 2024-04-29T18:32:09Z

Hello, here's a bit of feedback from a larger machine (two socket 64 cores)
This branch gives me about a 20% speedup in the naive case, but I see a 15% performance regression with numa flags vs master. Since numa flags give me a 4x speedup vs. not, this speed reduction is very significant for my specific case.

kunnis · 2024-04-29T19:17:39Z

It would be very useful if the order is restored so that the diff only shows the changes, it's hard to review otherwise.

Done

Hello, here's a bit of feedback from a larger machine (two socket 64 cores) This branch gives me about a 20% speedup in the naive case, but I see a 15% performance regression with numa flags vs master. Since numa flags give me a 4x speedup vs. not, this speed reduction is very significant for my specific case.

I'll ponder that case. Is the speed drop in PP or TG? How did you measure the change? Can you post llama-bench results? What sized model? If you want to really dig into it, compile it with GGML_PERF and find the section of the code that says These numbers are useful when trying to measure how well the threading scheduling works. and uncomment that, and run a llama-bench, then attach about 2000 lines of output (It'll be super chatty on your system). If possible, try to get logs when it's doing PP, and another when it's doing TG. It'll let me see if it's just a task size problem.

slaren · 2024-04-29T19:38:20Z

What numa flags are you using? For numa you probably want the same processors to always access the same fraction of the weights, so that it can be moved to the local memory of the node.

kunnis · 2024-04-29T22:47:44Z

@cpumaxx Also, would you mind testing this branch and see if it's the same/better/worse then master with NUMA on? It'll let me know if it's the locking implementation, or if it's the different scheduling that's the problem.

slaren · 2024-04-30T21:27:12Z

13900k under WSL2:

model	size	backend	threads	test	master t/s	PR t/s	speedup
llama 7B Q4_K - Medium	3.80 GiB	CPU	8	pp 128	47.47 ± 0.84	56.79 ± 0.87	1.20
llama 7B Q4_K - Medium	3.80 GiB	CPU	8	tg 32	16.01 ± 0.22	18.00 ± 0.10	1.12
llama 7B Q4_K - Medium	3.80 GiB	CPU	16	pp 128	42.67 ± 0.27	72.97 ± 1.01	1.71
llama 7B Q4_K - Medium	3.80 GiB	CPU	16	tg 32	14.45 ± 0.05	18.83 ± 0.03	1.30
llama 7B Q4_K - Medium	3.80 GiB	CPU	24	pp 128	60.48 ± 0.37	85.12 ± 0.89	1.41
llama 7B Q4_K - Medium	3.80 GiB	CPU	24	tg 32	16.17 ± 0.04	18.72 ± 0.04	1.16
llama 7B Q4_K - Medium	3.80 GiB	CPU	32	pp 128	68.64 ± 0.58	84.54 ± 2.63	1.23
llama 7B Q4_K - Medium	3.80 GiB	CPU	32	tg 32	14.20 ± 0.15	15.52 ± 0.08	1.09

So it's a very nice speedup for a CPU with e-cores. I didn't expect that it would also improve generation performance with 8 threads (=number of p-cores), I guess the OS scheduler is not doing a very good job here. Perplexity also looks good. I will try under native Windows as well later.

I think it is important to keep the option of using the previous fixed-scheduling type though, I expect will be especially important for numa systems, and it may also help with systems with only one type of CPU core.

kunnis · 2024-04-30T22:06:32Z

@slaren Yeah, I'm doing all my testing under windows. The speedup is similar there. I think even with a cpu that's all the same cores, it's an across the board speedup. OSes don't guarantee perfect scheduling, so a blockwise approach like the PR should help in almost any case. Even on a dedicated machine, the OS will still handle interrupts on one processor, so you can't assume it will have 100% of all processors. I think the only case where this won't help is a single threaded environment. Plugging in some reasonable worst-case numbers, this is one atomic operation per 16x1 (the size of one block of a token generation vec * matrix) x 4096 (the size of many of the matrixes in a 7b param model).. so one atomic per 64k FP operations. That's pretty low. The metrics I have don't show the atomic as being a notable part of the overhead.

I think there's still an issue with the PR re cpumaxx's problem, but I may have a solution to it. I'll want to see what he reports back to my questions. I'd like to totally get rid of the fixed scheduler, I think the automatic scheduler should be able to do all the work. This should be a speed neutral or better even with NUMA enabled. I'm wondering if the slowdowns he's seeing aren't scheduler related, but instead related to the changes I made to ggml_graph_compute_thread_sync_node and ggml_graph_compute_thread_sync_task. If so, there's another solution that I think will give us the best of both worlds. If you can replicate his issue, I'll write up the alternate version if you know a way to test it.

slaren · 2024-04-30T23:21:12Z

13900k with MSVC

model	size	backend	threads	test	master t/s	PR t/s	speedup
llama 7B Q4_K - Medium	3.80 GiB	CPU	8	pp 128	42.76 ± 0.46	49.61 ± 0.11	1.16
llama 7B Q4_K - Medium	3.80 GiB	CPU	8	tg 32	15.59 ± 0.23	17.23 ± 0.07	1.11
llama 7B Q4_K - Medium	3.80 GiB	CPU	16	pp 128	37.72 ± 0.03	63.41 ± 0.82	1.68
llama 7B Q4_K - Medium	3.80 GiB	CPU	16	tg 32	11.91 ± 0.04	16.37 ± 0.01	1.37
llama 7B Q4_K - Medium	3.80 GiB	CPU	24	pp 128	52.54 ± 0.17	72.59 ± 0.29	1.38
llama 7B Q4_K - Medium	3.80 GiB	CPU	24	tg 32	12.51 ± 0.02	14.99 ± 0.06	1.20
llama 7B Q4_K - Medium	3.80 GiB	CPU	32	pp 128	33.28 ± 0.56	67.71 ± 1.78	2.03
llama 7B Q4_K - Medium	3.80 GiB	CPU	32	tg 32	4.03 ± 0.11	6.50 ± 0.34	1.61

Similar speedup, but worse absolute performance.

kunnis · 2024-04-30T23:48:35Z

@slaren Would you mind trying https://github.com/kunnis/llama.cpp/tree/MMThreadingPerfChangeWithMmPause as an alternative with your setup? It's a lighter weight change to ggml_graph_compute_thread_sync_node and ggml_graph_compute_thread_sync_task. It's not as fast for me as the PR version, but it's close. Maybe it's a better compromise between all use cases.

ggerganov · 2024-05-14T13:05:59Z

@calculatortamer Slightly off-topic - does enabling flash-attention (-fa 1) improve the performance on any of these models on your orange pi?

kunnis · 2024-05-14T16:43:58Z

@slaren Yeah. Is there a formatter I need to run or something?

slaren · 2024-05-14T18:08:46Z

The formatting needs to be done manually. Automatic formatters usually remove the vertical alignment, so they are not used.

calculatortamer · 2024-05-14T20:21:37Z

@calculatortamer Slightly off-topic - does enabling flash-attention (-fa 1) improve the performance on any of these models on your orange pi?

@ggerganov i did some speed test

./llama-bench -fa 0,1 -t 4,8 -r 2 -m Meta-Llama-3-8B-Instruct-IQ4_XS.gguf

model	size	params	backend	threads	fa	test	t/s
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	CPU	4	0	pp 512	5.86 ± 0.00
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	CPU	4	0	tg 128	4.41 ± 0.03
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	CPU	8	0	pp 512	6.22 ± 0.00
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	CPU	8	0	tg 128	4.55 ± 0.10
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	CPU	4	1	pp 512	5.64 ± 0.00
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	CPU	4	1	tg 128	4.46 ± 0.01
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	CPU	8	1	pp 512	5.83 ± 0.00
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	CPU	8	1	tg 128	4.52 ± 0.06

build: bd80601 (2844)

conclusion: TG same within margin of error, smaller PP (~ -5%)

./llama-bench -fa 0,1 -t 4,8 -r 5 -m ../llama.cpp/models/TinyLlama-1.1B-intermediate-step-1431k-3T.IQ4_XS.gguf,../llama.cpp/models/tinyllama-1.1b-intermediate-step-1431k-3t.Q4_K_S.gguf

model	size	params	backend	threads	fa	test	t/s
llama 1B IQ4_XS - 4.25 bpw	580.86 MiB	1.10 B	CPU	4	0	pp 512	37.10 ± 0.07
llama 1B IQ4_XS - 4.25 bpw	580.86 MiB	1.10 B	CPU	4	0	tg 128	20.58 ± 0.27
llama 1B IQ4_XS - 4.25 bpw	580.86 MiB	1.10 B	CPU	8	0	pp 512	34.38 ± 0.01
llama 1B IQ4_XS - 4.25 bpw	580.86 MiB	1.10 B	CPU	8	0	tg 128	26.16 ± 0.45
llama 1B IQ4_XS - 4.25 bpw	580.86 MiB	1.10 B	CPU	4	1	pp 512	35.34 ± 0.01
llama 1B IQ4_XS - 4.25 bpw	580.86 MiB	1.10 B	CPU	4	1	tg 128	20.13 ± 0.21
llama 1B IQ4_XS - 4.25 bpw	580.86 MiB	1.10 B	CPU	8	1	pp 512	32.37 ± 0.02
llama 1B IQ4_XS - 4.25 bpw	580.86 MiB	1.10 B	CPU	8	1	tg 128	25.54 ± 0.26
llama 1B Q4_K - Small	612.28 MiB	1.10 B	CPU	4	0	pp 512	31.31 ± 0.08
llama 1B Q4_K - Small	612.28 MiB	1.10 B	CPU	4	0	tg 128	18.92 ± 0.18
llama 1B Q4_K - Small	612.28 MiB	1.10 B	CPU	8	0	pp 512	29.83 ± 1.53
llama 1B Q4_K - Small	612.28 MiB	1.10 B	CPU	8	0	tg 128	24.31 ± 0.22
llama 1B Q4_K - Small	612.28 MiB	1.10 B	CPU	4	1	pp 512	30.18 ± 0.02
llama 1B Q4_K - Small	612.28 MiB	1.10 B	CPU	4	1	tg 128	18.66 ± 0.12
llama 1B Q4_K - Small	612.28 MiB	1.10 B	CPU	8	1	pp 512	28.96 ± 0.02
llama 1B Q4_K - Small	612.28 MiB	1.10 B	CPU	8	1	tg 128	23.11 ± 0.64

conclusion: smaller TG (equal at best), smaller PP

./llama-bench -fa 0,1 -t 4,8 -r 3 -m ../llama.cpp/models/Phi-3-mini-4k-instruct.IQ4_XS.gguf,../llama.cpp/models/Phi-3-mini-4k-instruct.Q4_K_S.gguf,../llama.cpp/models/Phi-3-mini-4k-instruct.Q4_0.gguf

model	size	params	backend	threads	fa	test	t/s
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CPU	4	0	pp 512	10.63 ± 0.03
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CPU	4	0	tg 128	7.87 ± 0.07
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CPU	8	0	pp 512	10.28 ± 0.28
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CPU	8	0	tg 128	8.14 ± 0.07
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CPU	4	1	pp 512	9.22 ± 0.15
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CPU	4	1	tg 128	7.99 ± 0.07
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CPU	8	1	pp 512	8.78 ± 0.01
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CPU	8	1	tg 128	8.31 ± 0.04
phi3 3B Q4_K - Small	2.04 GiB	3.82 B	CPU	4	0	pp 512	8.88 ± 0.00
phi3 3B Q4_K - Small	2.04 GiB	3.82 B	CPU	4	0	tg 128	7.01 ± 0.02
phi3 3B Q4_K - Small	2.04 GiB	3.82 B	CPU	8	0	pp 512	9.16 ± 0.02
phi3 3B Q4_K - Small	2.04 GiB	3.82 B	CPU	8	0	tg 128	7.84 ± 0.05
phi3 3B Q4_K - Small	2.04 GiB	3.82 B	CPU	4	1	pp 512	8.52 ± 0.00
phi3 3B Q4_K - Small	2.04 GiB	3.82 B	CPU	4	1	tg 128	7.04 ± 0.03
phi3 3B Q4_K - Small	2.04 GiB	3.82 B	CPU	8	1	pp 512	8.59 ± 0.00
phi3 3B Q4_K - Small	2.04 GiB	3.82 B	CPU	8	1	tg 128	7.77 ± 0.09
phi3 3B Q4_0	2.03 GiB	3.82 B	CPU	4	0	pp 512	10.59 ± 0.02
phi3 3B Q4_0	2.03 GiB	3.82 B	CPU	4	0	tg 128	4.89 ± 0.17
phi3 3B Q4_0	2.03 GiB	3.82 B	CPU	8	0	pp 512	2.70 ± 0.01
phi3 3B Q4_0	2.03 GiB	3.82 B	CPU	8	0	tg 128	1.34 ± 0.04
phi3 3B Q4_0	2.03 GiB	3.82 B	CPU	4	1	pp 512	9.87 ± 0.05
phi3 3B Q4_0	2.03 GiB	3.82 B	CPU	4	1	tg 128	5.04 ± 0.07
phi3 3B Q4_0	2.03 GiB	3.82 B	CPU	8	1	pp 512	3.55 ± 0.00
phi3 3B Q4_0	2.03 GiB	3.82 B	CPU	8	1	tg 128	1.32 ± 0.04

build: bd80601 (2844)

conclusion IQ4_XS: better TG all of a sudden??? at a cost of a smaller PP
conclusion Q4_K_S: smaller TG (equal at best), smaller PP
conclusion Q4_0: smaller TG, smaller PP except at 8 cores which increased with FA

maybe phi-3 mini is the perfect balance for my orange pi?

rerun IQ4_XS phi3 model but with -r 10
./llama-bench -fa 0,1 -t 8 -r 10 -m ../llama.cpp/models/Phi-3-mini-4k-instruct.IQ4_XS.gguf

model	size	params	backend	threads	fa	test	t/s
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CPU	8	0	pp 512	10.50 ± 0.02
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CPU	8	0	tg 128	8.21 ± 0.11
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CPU	8	1	pp 512	9.58 ± 0.08
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CPU	8	1	tg 128	8.23 ± 0.10

build: bd80601 (2844)

... nevermind

overall: no performance improvement, FA always made the PP smaller and TG smaller or equal at best
except in first run of phi3 3B IQ4_XS, TG increased slightly (+0.20t/s @ 8 cores), but PP was reduced (-2t/s @ 8 cores)
but the second run of phi3 3B IQ4_XS proved that TG increase was just within the margin of error, so no improvement and PP was still significantly reduced

also in phi3 Q4_0, PP increased @ 8 cores, but it's nothing when considering the quant's horrible performance compared to 4 cores

tldr: no, it only reduces the PP

i hope you find something interesting

kunnis · 2024-05-14T21:50:44Z

The formatting needs to be done manually. Automatic formatters usually remove the vertical alignment, so they are not used.

Sure. I see several spots where my IDE changed spacing around asterisks. I was watching for indenting. I'll go through the diff and see if I spot any other whitespace changes, but is there anything else I should watch out for?

slaren

Some examples

ggml.c

kunnis · 2024-05-14T21:57:58Z

Thanks. I'll knock those out.

kunnis · 2024-05-14T22:28:03Z

I fixed the style I think.

I also found I'd left a bit of windows debug code I left in, and removed it. There's a patch that does a better job of something similar that's waiting to be merged in.

kunnis · 2024-05-14T22:51:01Z

@slaren What are your thoughts on my next PR being about moving everything related to numa, thread configuration, barriers, atomics, pthread into a file named ggml-threading.h Should be a lot simpler. Is there a better place to discuss this?

slaren · 2024-05-14T23:16:40Z

I think that would be good, but you should check with @ggerganov. I think here is fine, but you can also start an issue or discussion if you want more broad feedback.

ggml.c

slaren · 2024-05-14T23:20:53Z

There are a few warnings when building without llamafile:

ggml.c: In function ‘ggml_compute_forward_mul_mat’:
ggml.c:12040:18: warning: unused variable ‘row_size’ [-Wunused-variable]
12040 |     const size_t row_size = ggml_row_size(vec_dot_type, ne10);
      |                  ^~~~~~~~
ggml.c:12039:18: warning: unused variable ‘wdata’ [-Wunused-variable]
12039 |     const void * wdata    = (src1->type == vec_dot_type) ? src1->data : params->wdata;
      |                  ^~~~~
ggml.c:11906:19: warning: unused variable ‘r3’ [-Wunused-variable]
11906 |     const int64_t r3 = ne13/ne03;
      |                   ^~
ggml.c:11905:19: warning: unused variable ‘r2’ [-Wunused-variable]
11905 |     const int64_t r2 = ne12/ne02;
      |                   ^~
ggml.c:11883:16: warning: unused variable ‘src1_cont’ [-Wunused-variable]
11883 |     const bool src1_cont = ggml_is_contiguous(src1);
      |

kunnis · 2024-05-14T23:25:00Z

I'll update the PR to address those. I thought I'd checked that case.

ggerganov · 2024-05-15T10:19:55Z

@slaren What are your thoughts on my next PR being about moving everything related to numa, thread configuration, barriers, atomics, pthread into a file named ggml-threading.h Should be a lot simpler. Is there a better place to discuss this?

I think that would be good, but you should check with @ggerganov. I think here is fine, but you can also start an issue or discussion if you want more broad feedback.

Yes, it's OK to add ggml-threading.h/.c

…org#6915) * Just reordering some structs. * Adding in the calls to mm_pause * Passing around the state * Renaming and moving a bunch of variables around. * Extracting the logic to it's own function. * Moving some variable definitions into the chunk function. * Moving some variables around * moving src1_cont inside * Moving row_size * adding the current_chunk * Reorg the code. * Formatting to match the orig patch * starting to setup the chunking variables * Starting the buildup of the loop * The yield shouldn't be necessary. * adding the looping structure based on the chunk configuration. * Add in the re-chunking code. * Making it much more likely to rechunk. * disable resizing if numa is enabled. * Updating comments with what we've learned. * Fix formatting * Couple more formatting fixes. * More style fixes. * Fix Warnings * Going with unused because there's conditional logic that needs it. * Update ggml.c * Update ggml.c ---------

ggml-org#6915)" This reverts commit e1b40ac.

Jeximo mentioned this pull request Apr 26, 2024

Help test CPUSet patch for Windows and Linux #6927

Closed

kunnis force-pushed the MMThreadingPerfChange branch 2 times, most recently from 6eb46e2 to 54c2460 Compare April 28, 2024 04:44

kunnis marked this pull request as ready for review April 28, 2024 23:24

slaren reviewed May 14, 2024

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

ggml.c Outdated Show resolved Hide resolved

ggml.c Outdated Show resolved Hide resolved

ggml.c Outdated Show resolved Hide resolved

kunnis added 2 commits May 14, 2024 17:15

Fix formatting

d9ba30a

Couple more formatting fixes.

163dbfd

More style fixes.

6b0c90f

slaren approved these changes May 14, 2024

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

ggml.c Outdated Show resolved Hide resolved

kunnis added 2 commits May 14, 2024 18:32

Fix Warnings

741a198

Going with unused because there's conditional logic that needs it.

2dd9f01

slaren changed the title ~~Draft Idea... CPU Inference... This seems to perform better?~~ ggml : use dynamic thread scheduling for matrix multiplication May 15, 2024

slaren added 2 commits May 15, 2024 18:45

Update ggml.c

f2aabab

Update ggml.c

14c104d

slaren requested a review from ggerganov May 15, 2024 16:46

ggerganov approved these changes May 15, 2024

View reviewed changes

slaren merged commit e1b40ac into ggml-org:master May 15, 2024
60 of 65 checks passed

kunnis deleted the MMThreadingPerfChange branch May 20, 2024 19:28

jart added a commit to jart/llama.cpp that referenced this pull request May 22, 2024

Revert "ggml : use dynamic thread scheduling for matrix multiplication (

ae6ee0b

ggml-org#6915)" This reverts commit e1b40ac.

jart mentioned this pull request May 22, 2024

Introduce ggml_syncthreads() #7455

Open

kunnis mentioned this pull request May 27, 2024

ggml-threading.cpp #7576

Draft

ggml : use dynamic thread scheduling for matrix multiplication #6915

ggml : use dynamic thread scheduling for matrix multiplication #6915

Conversation

kunnis commented Apr 26, 2024

github-actions bot commented Apr 26, 2024 • edited Loading

Jeximo commented Apr 26, 2024 • edited Loading

p-e-w commented Apr 26, 2024

slaren commented Apr 26, 2024

BarfingLemurs commented Apr 26, 2024

kunnis commented Apr 26, 2024

USBhost commented Apr 26, 2024

kunnis commented Apr 26, 2024

kunnis commented Apr 27, 2024

kunnis commented Apr 27, 2024

LostRuins commented Apr 27, 2024

kunnis commented Apr 27, 2024

kunnis commented Apr 27, 2024 • edited Loading

kunnis commented Apr 28, 2024

Jeximo commented Apr 28, 2024

kunnis commented Apr 28, 2024

Jeximo commented Apr 29, 2024

slaren commented Apr 29, 2024

cpumaxx commented Apr 29, 2024

kunnis commented Apr 29, 2024

slaren commented Apr 29, 2024

kunnis commented Apr 29, 2024

slaren commented Apr 30, 2024

kunnis commented Apr 30, 2024

slaren commented Apr 30, 2024

kunnis commented Apr 30, 2024

ggerganov commented May 14, 2024

kunnis commented May 14, 2024

slaren commented May 14, 2024

calculatortamer commented May 14, 2024 • edited Loading

kunnis commented May 14, 2024

slaren left a comment

Choose a reason for hiding this comment

kunnis commented May 14, 2024

kunnis commented May 14, 2024

kunnis commented May 14, 2024

slaren commented May 14, 2024

slaren commented May 14, 2024

kunnis commented May 14, 2024

ggerganov commented May 15, 2024

github-actions bot commented Apr 26, 2024 •

edited

Loading

Jeximo commented Apr 26, 2024 •

edited

Loading

kunnis commented Apr 27, 2024 •

edited

Loading

calculatortamer commented May 14, 2024 •

edited

Loading