Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : use dynamic thread scheduling for matrix multiplication #6915

Merged
merged 27 commits into from
May 15, 2024

Conversation

kunnis
Copy link
Contributor

@kunnis kunnis commented Apr 26, 2024

This is a draft of an idea I want feedback/thoughts/suggestions on.

I was working on a different issue, and I was having problems getting stable timing results. I ran llama-bench, and was getting pretty different results with each run. See before.txt PP timings varied between 21.46 to 32.42 t/s. TG was between 11.59 to 12.86 t/s.

So I looked at how the threading works in ggml_compute_forward_mul_mat. Since work is pre-dispatched to the thread, each thread will always do the same amount of work, and not adapt if one thread is stalled by the OS, that'll delay the whole operation.
So what I did was break it into 16*number_of_threads slices, and had it work through the slices. I pulled all the guts of the operation out into it's own function, and used the existing logic that divides up into 16x more chunks. Each thread then grabs an ID of what the next chunk of work to do. This gave a more stable and faster timing results.
after.txt PP between 20.83-35.12 t/s and TG between 12.82-13.47 t/s.

So the range of the average went from 1.27 to 0.65, along with the speed going up. Do you have a better metric I should try for measuring this? Maybe I need to do llama-bench with -r 100 or something silly and let it run overnight, and compare the two results? I'm trying to improve my testing process so I'm more sure about the results.

What are your thoughts on the general idea of changing like this? I don't like the global I had to add, but I didn't want to add it to ggml_compute_params because I'd have to add atomic.h to the header... but maybe that's okay? I'll do more tests, and test it on Linux before posting this as a PR, plus I wanna fix that global, get better timing comparisons, etc, etc.

Copy link
Contributor

github-actions bot commented Apr 26, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 536 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8723.78ms p(95)=21786.6ms fails=, finish reason: stop=469 truncated=67
  • Prompt processing (pp): avg=104.86tk/s p(95)=475.83tk/s
  • Token generation (tg): avg=35.74tk/s p(95)=44.8tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=MMThreadingPerfChange commit=14c104d1668716e1049fb99df121e9d60efb1926

prompt_tokens_seconds

More
Loading
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715793170 --> 1715793796
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 305.45, 305.45, 305.45, 305.45, 305.45, 756.98, 756.98, 756.98, 756.98, 756.98, 651.0, 651.0, 651.0, 651.0, 651.0, 673.55, 673.55, 673.55, 673.55, 673.55, 719.88, 719.88, 719.88, 719.88, 719.88, 733.76, 733.76, 733.76, 733.76, 733.76, 734.2, 734.2, 734.2, 734.2, 734.2, 752.77, 752.77, 752.77, 752.77, 752.77, 768.51, 768.51, 768.51, 768.51, 768.51, 769.16, 769.16, 769.16, 769.16, 769.16, 782.34, 782.34, 782.34, 782.34, 782.34, 809.92, 809.92, 809.92, 809.92, 809.92, 830.23, 830.23, 830.23, 830.23, 830.23, 845.04, 845.04, 845.04, 845.04, 845.04, 877.13, 877.13, 877.13, 877.13, 877.13, 779.37, 779.37, 779.37, 779.37, 779.37, 794.8, 794.8, 794.8, 794.8, 794.8, 793.68, 793.68, 793.68, 793.68, 793.68, 817.09, 817.09, 817.09, 817.09, 817.09, 813.81, 813.81, 813.81, 813.81, 813.81, 811.35, 811.35, 811.35, 811.35, 811.35, 816.32, 816.32, 816.32, 816.32, 816.32, 820.5, 820.5, 820.5, 820.5, 820.5, 818.55, 818.55, 818.55, 818.55, 818.55, 797.85, 797.85, 797.85, 797.85, 797.85, 801.32, 801.32, 801.32, 801.32, 801.32, 819.32, 819.32, 819.32, 819.32, 819.32, 817.55, 817.55, 817.55, 817.55, 817.55, 816.93, 816.93, 816.93, 816.93, 816.93, 814.78, 814.78, 814.78, 814.78, 814.78, 817.9, 817.9, 817.9, 817.9, 817.9, 821.81, 821.81, 821.81, 821.81, 821.81, 819.37, 819.37, 819.37, 819.37, 819.37, 823.81, 823.81, 823.81, 823.81, 823.81, 824.43, 824.43, 824.43, 824.43, 824.43, 833.3, 833.3, 833.3, 833.3, 833.3, 843.76, 843.76, 843.76, 843.76, 843.76, 844.62, 844.62, 844.62, 844.62, 844.62, 842.9, 842.9, 842.9, 842.9, 842.9, 841.4, 841.4, 841.4, 841.4, 841.4, 842.0, 842.0, 842.0, 842.0, 842.0, 842.02, 842.02, 842.02, 842.02, 842.02, 836.66, 836.66, 836.66, 836.66, 836.66, 806.5, 806.5, 806.5, 806.5, 806.5, 806.16, 806.16, 806.16, 806.16, 806.16, 805.53, 805.53, 805.53, 805.53, 805.53, 804.31, 804.31, 804.31, 804.31, 804.31, 808.84, 808.84, 808.84, 808.84, 808.84, 811.88, 811.88, 811.88, 811.88, 811.88, 811.95, 811.95, 811.95, 811.95, 811.95, 817.37, 817.37, 817.37, 817.37, 817.37, 817.25, 817.25, 817.25, 817.25, 817.25, 821.31, 821.31, 821.31, 821.31, 821.31, 824.04, 824.04, 824.04, 824.04, 824.04, 826.06, 826.06, 826.06, 826.06, 826.06, 830.53, 830.53, 830.53, 830.53, 830.53, 831.27, 831.27, 831.27, 831.27, 831.27, 831.11, 831.11, 831.11, 831.11, 831.11, 832.12, 832.12, 832.12, 832.12, 832.12, 833.56, 833.56, 833.56, 833.56, 833.56, 833.36, 833.36, 833.36, 833.36]
                    
predicted_tokens_seconds
More
Loading
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715793170 --> 1715793796
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.95, 38.95, 38.95, 38.95, 38.95, 42.68, 42.68, 42.68, 42.68, 42.68, 37.18, 37.18, 37.18, 37.18, 37.18, 30.27, 30.27, 30.27, 30.27, 30.27, 30.25, 30.25, 30.25, 30.25, 30.25, 30.91, 30.91, 30.91, 30.91, 30.91, 32.03, 32.03, 32.03, 32.03, 32.03, 32.93, 32.93, 32.93, 32.93, 32.93, 33.33, 33.33, 33.33, 33.33, 33.33, 33.25, 33.25, 33.25, 33.25, 33.25, 33.35, 33.35, 33.35, 33.35, 33.35, 33.41, 33.41, 33.41, 33.41, 33.41, 33.38, 33.38, 33.38, 33.38, 33.38, 32.6, 32.6, 32.6, 32.6, 32.6, 31.65, 31.65, 31.65, 31.65, 31.65, 31.16, 31.16, 31.16, 31.16, 31.16, 30.85, 30.85, 30.85, 30.85, 30.85, 31.21, 31.21, 31.21, 31.21, 31.21, 31.22, 31.22, 31.22, 31.22, 31.22, 30.9, 30.9, 30.9, 30.9, 30.9, 30.75, 30.75, 30.75, 30.75, 30.75, 30.73, 30.73, 30.73, 30.73, 30.73, 30.94, 30.94, 30.94, 30.94, 30.94, 31.11, 31.11, 31.11, 31.11, 31.11, 31.01, 31.01, 31.01, 31.01, 31.01, 31.1, 31.1, 31.1, 31.1, 31.1, 31.3, 31.3, 31.3, 31.3, 31.3, 31.14, 31.14, 31.14, 31.14, 31.14, 30.55, 30.55, 30.55, 30.55, 30.55, 30.6, 30.6, 30.6, 30.6, 30.6, 30.81, 30.81, 30.81, 30.81, 30.81, 31.05, 31.05, 31.05, 31.05, 31.05, 31.2, 31.2, 31.2, 31.2, 31.2, 31.34, 31.34, 31.34, 31.34, 31.34, 31.23, 31.23, 31.23, 31.23, 31.23, 31.24, 31.24, 31.24, 31.24, 31.24, 31.16, 31.16, 31.16, 31.16, 31.16, 30.76, 30.76, 30.76, 30.76, 30.76, 30.79, 30.79, 30.79, 30.79, 30.79, 30.86, 30.86, 30.86, 30.86, 30.86, 31.02, 31.02, 31.02, 31.02, 31.02, 31.08, 31.08, 31.08, 31.08, 31.08, 31.15, 31.15, 31.15, 31.15, 31.15, 31.04, 31.04, 31.04, 31.04, 31.04, 30.58, 30.58, 30.58, 30.58, 30.58, 30.54, 30.54, 30.54, 30.54, 30.54, 29.39, 29.39, 29.39, 29.39, 29.39, 29.32, 29.32, 29.32, 29.32, 29.32, 29.37, 29.37, 29.37, 29.37, 29.37, 29.5, 29.5, 29.5, 29.5, 29.5, 29.53, 29.53, 29.53, 29.53, 29.53, 29.65, 29.65, 29.65, 29.65, 29.65, 29.69, 29.69, 29.69, 29.69, 29.69, 29.65, 29.65, 29.65, 29.65, 29.65, 29.57, 29.57, 29.57, 29.57, 29.57, 29.49, 29.49, 29.49, 29.49, 29.49, 29.53, 29.53, 29.53, 29.53, 29.53, 29.7, 29.7, 29.7, 29.7, 29.7, 29.84, 29.84, 29.84, 29.84, 29.84, 29.89, 29.89, 29.89, 29.89, 29.89, 29.92, 29.92, 29.92, 29.92]
                    

Details

kv_cache_usage_ratio

More
Loading
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715793170 --> 1715793796
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.46, 0.46, 0.46, 0.46, 0.46, 0.23, 0.23, 0.23, 0.23, 0.23, 0.31, 0.31, 0.31, 0.31, 0.31, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.45, 0.45, 0.45, 0.45, 0.45, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.29, 0.29, 0.29, 0.29, 0.29, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.29, 0.29, 0.29, 0.29, 0.29, 0.34, 0.34, 0.34, 0.34, 0.34, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.34, 0.34, 0.34, 0.34, 0.34, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.08, 0.08, 0.08, 0.08, 0.08, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.31, 0.31, 0.31, 0.31, 0.31, 0.57, 0.57, 0.57, 0.57, 0.57, 0.5, 0.5, 0.5, 0.5, 0.5, 0.53, 0.53, 0.53, 0.53, 0.53, 0.33, 0.33, 0.33, 0.33, 0.33, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15]
                    
requests_processing
More
Loading
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715793170 --> 1715793796
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0]
                    

@Jeximo
Copy link
Contributor

Jeximo commented Apr 26, 2024

uname -a
Linux localhost 4.14.190-23725627-abG975WVLS8IWD1 #2 SMP PREEMPT Mon Apr 10 18:16:39 KST 2023 aarch64 Android

I'm not sure how likely this is to be merged, but it's faster:

master:

llama_print_timings:        load time =   10137.45 ms
llama_print_timings:      sample time =      30.83 ms /   159 runs   (    0.19 ms per token,  5156.98 tokens per second)
llama_print_timings: prompt eval time =   41829.56 ms /    59 tokens (  708.98 ms per token,     1.41 tokens per second)
llama_print_timings:        eval time =   59630.39 ms /   158 runs   (  377.41 ms per token,     2.65 tokens per second)
llama_print_timings:       total time =  106957.99 ms /   217 tokens

pr:

llama_print_timings:        load time =    9489.74 ms                                              
llama_print_timings:      sample time =      29.13 ms /   159 runs   (    0.18 ms per token,  5458.48 tokens per second)
llama_print_timings: prompt eval time =   18258.65 ms /    59 tokens (  309.47 ms per token,     3.23 tokens per second)
llama_print_timings:        eval time =   55050.81 ms /   158 runs   (  348.42 ms per token,     2.87 tokens per second)
llama_print_timings:       total time =   75499.96 ms /   217 tokens

Greater than 2x speed increase prompt eval time with 7b IQ_4XS on CPU

here's arguments:
./main -m ~/WaveCoder-Ultra-6.7b.IQ4_XS.gguf -ins --interactive-first --penalize-nl --in-suffix "### Response:" --temp 0 -c 2048 -t 4 -b 10 -p "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request."

Built with cmake -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod+i8mm

@p-e-w
Copy link

p-e-w commented Apr 26, 2024

Judging by @Jeximo's numbers, this is a substantial improvement. But it surprises me that even after this change, you're getting a PP performance range of 20.83-35.12 t/s.

That's a huge gap. I can't reproduce this. And in your case, you were getting the max and min performance multiple times! With 6 threads, this doesn't seem explicable just by OS scheduling quirks.

Perhaps it makes sense to wrap the invocation in something like Hyperfine, which will keep running the benchmark until the statistics stabilize.

@slaren
Copy link
Member

slaren commented Apr 26, 2024

Probably related: #1507

@BarfingLemurs
Copy link
Contributor

@Jeximo I couldn't see an android performance boost. Mind sharing your arguments? If -tb is set manually to greater than 4, eg: 5, the speed dramatically decreases from 22 t/s to 7 t/s, testing tinyllama

@kunnis
Copy link
Contributor Author

kunnis commented Apr 26, 2024

Judging by @Jeximo's numbers, this is a substantial improvement. But it surprises me that even after this change, you're getting a PP performance range of 20.83-35.12 t/s.

That's a huge gap. I can't reproduce this. And in your case, you were getting the max and min performance multiple times! With 6 threads, this doesn't seem explicable just by OS scheduling quirks.

Perhaps it makes sense to wrap the invocation in something like Hyperfine, which will keep running the benchmark until the statistics stabilize.

Yeah, I agree something is up, I just haven't figured out what it could be though. And the 20.83-35.12 t/s. is range of averages. It means my benchmarking isn't stable, I'm aware it's an issue, I just haven't found the cause yet.

@USBhost
Copy link

USBhost commented Apr 26, 2024

python koboldcpp.py --threads 24 --contextsize 65536 --model ../Mixtral-8x22B-Instruct-v0.1/ggml-model-Q8_0.gguf
I'm using koboldcpp with this PR running on a AMD EPYC 7F72 server with nothing running. Inference is actually 'slower' but so small.... This PR may be good for a system that's not idle?

With PR:
Processing Prompt (158 / 158 tokens)
Generating (120 / 120 tokens)
CtxLimit: 278/65536, Process:12.77s (80.8ms/T = 12.37T/s), Generate:39.64s (330.3ms/T = 3.03T/s), Total:52.41s (2.29T/s)
Processing Prompt (135 / 135 tokens)
Generating (120 / 120 tokens)
CtxLimit: 278/65536, Process:10.96s (81.2ms/T = 12.32T/s), Generate:39.64s (330.3ms/T = 3.03T/s), Total:50.59s (2.37T/s)
Processing Prompt (132 / 132 tokens)
Generating (120 / 120 tokens)
CtxLimit: 278/65536, Process:10.73s (81.3ms/T = 12.30T/s), Generate:39.67s (330.6ms/T = 3.02T/s), Total:50.40s (2.38T/s)

Without:
Processing Prompt (158 / 158 tokens)
Generating (120 / 120 tokens)
CtxLimit: 278/65536, Process:12.73s (80.6ms/T = 12.41T/s), Generate:38.92s (324.3ms/T = 3.08T/s), Total:51.65s (2.32T/s)
Processing Prompt (135 / 135 tokens)
Generating (120 / 120 tokens)
CtxLimit: 278/65536, Process:10.91s (80.8ms/T = 12.38T/s), Generate:38.92s (324.4ms/T = 3.08T/s), Total:49.83s (2.41T/s)
Processing Prompt (135 / 135 tokens)
Generating (120 / 120 tokens)
CtxLimit: 278/65536, Process:10.92s (80.9ms/T = 12.36T/s), Generate:38.89s (324.1ms/T = 3.09T/s), Total:49.81s (2.41T/s)

@kunnis
Copy link
Contributor Author

kunnis commented Apr 26, 2024

Just a note... for my testing, I've disabled all BLASs, including Llamafile.

@kunnis
Copy link
Contributor Author

kunnis commented Apr 27, 2024

It was all windows scheduling, for testing, I threw in some basic affinity code (the n*2 is because I have HT enabled, and I want' it on different physical cores)

static void set_numa_thread_affinity(int thread_n) {
    UNUSED(thread_n);
    HANDLE handle = GetCurrentThread();
    SetThreadPriority(handle, THREAD_PRIORITY_ABOVE_NORMAL);
    SetThreadAffinityMask(handle, 1 << (thread_n*2));
}

And I'm getting tight timing results on windows now, with a 4% margin of error on PP, and sub 1% on tg. I also added timing to ggml_compute_forward_mul_mat_one_chunk But that means that scheduling stalls are the issue.

@slaren If I write something like my code, but without the janky "Just divide by 16...", where it properly partitions it like #1507 does, would that probably get merged in? (also, thanks for the link to 1507) If so, I'll cook that up over the weekend.

@kunnis
Copy link
Contributor Author

kunnis commented Apr 27, 2024

I couldn't sleep, so I coded the idea up.

@LostRuins
Copy link
Collaborator

One confounding factor to take account for is the behavior of E-Cores vs P-Cores. On newer systems a thread running on an E-Core can be significantly slower than one running on a P-Core causing a potential bottleneck not present in older systems, or systems with different scheduling behavior.

The thread scheduling being OS dependent and unpredictable may make comparisons tricky, might be worth disabling efficient cores first to compare the improvement.

Maybe Related:
LostRuins#447

@kunnis
Copy link
Contributor Author

kunnis commented Apr 27, 2024

I'm just running a Ryzen 7600, so I'm not having the e-core/p-core situation, but it should address that issue as a side effect. The updated PR divides most MMs of a 7b model into 10000+ chunks, and threads that were previously finishing up to 4000us apart (even with prioritization that I hacked in for testing) are now finishing within about 20us of each other. And this should be a pretty clean solution that's cross platform. Instead of assuming all threads will get the same amount of execution time, just divide the larger MM into small blocks, and execute the smaller blocks.

@kunnis
Copy link
Contributor Author

kunnis commented Apr 27, 2024

I removed the global, moving it to shared state, but that required moving a bit of code around to allow that. Next I'll go through and fixed build errors. It's just unused variables from some configs, I just need to address them.

Maybe we should move the thread configuration (atomic_store, ggml_thread_create, set_numa_thread_affinity, etc) to it's own file?

I also don't like that I had to pass the ggml_compute_state through ggml_compute_forward, but I don't see a better way of doing it. It kinda seems right to have it there, because everything needs arguments like ith and nth that originate on state.

@kunnis kunnis force-pushed the MMThreadingPerfChange branch 2 times, most recently from 6eb46e2 to 54c2460 Compare April 28, 2024 04:44
@kunnis
Copy link
Contributor Author

kunnis commented Apr 28, 2024

@slaren If you have a chance, could you do a review of this PR? I think it might be ready. It produces stable timing results, along with a mild speed gain, after 2.txt and the worst-case have almost all gone away. You might want to look at my change to ggml_graph_compute_thread_sync_node, I found it added to the speed boost as well, but maybe that's too much for this PR. I had to move several methods around in the file so I could pass the ggml_compute_state around. I think that's reasonable to pass around, because we end up passing around several of the values on the ggml_compute_state... we should just pass that around instead. But maybe that's the wrong call? Do you mind the slightly longer variable names? The short variable names like ir110 start getting confusing to me.

Since github isn't showing the diff very well, the only things changed in the moved code is adding current_chunk to ggml_compute_state_shared I realized I could have avoided most of the moves by declaring functions ahead of time, hindsight is 20/20. If you'd prefer that, LMK, and I'll figure out how to return them to their original locations.

I also thought about why we may be seeing a difference in speed reproducibility. I'm on windows, so my set_numa_thread_affinity is an empty function. I did some test runs where I did a hacky job of setting my thread_affinity & thread priority, and that helped stabilize the timings as well. Do you think that might explain why PEW and I had different experiences?

@Jeximo
Copy link
Contributor

Jeximo commented Apr 28, 2024

Now PR seems equal to master for my device(fresh clone & patch)

llama_print_timings:        load time =    3224.59 ms
llama_print_timings:      sample time =      38.06 ms /   217 runs   (    0.18 ms per token,  5700.92 tokens per second)
llama_print_timings: prompt eval time =   40340.46 ms /    60 tokens (  672.34 ms per token,     1.49 tokens per second)
llama_print_timings:        eval time =   71808.87 ms /   216 runs   (  332.45 ms per token,     3.01 tokens per second)
llama_print_timings:       total time =  114604.73 ms /   276 tokens

Just a note... for my testing, I've disabled all BLASs, including Llamafile.

No BLAS, and disabled llamafile:
system_info: n_threads = 4 / 8 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LAMMAFILE = 0 |

The thread scheduling being OS dependent and unpredictable may make comparisons tricky

Yes, likely my result is not easily reproducable as I cannot control thread affinity on Android.

@kunnis
Copy link
Contributor Author

kunnis commented Apr 28, 2024

@Jeximo You may want to try doubling the number of threads, and see if that helps. The PR should handle poor OS scheduling better than master. In theory, generally you should get best performance when number of threads == number of logical processors, but master nor this PR hit that number on windows. I'm still getting better performance when number of threads == number of cores, but maybe you'll get something different.

@kunnis kunnis marked this pull request as ready for review April 28, 2024 23:24
@Jeximo
Copy link
Contributor

Jeximo commented Apr 29, 2024

The PR should handle poor OS scheduling better than master.

Ah, yeah master llama.cpp w/ -t 8 normally kills my OS - PR actually allows tokenization(not faster, but functional).

@slaren
Copy link
Member

slaren commented Apr 29, 2024

Since github isn't showing the diff very well, the only things changed in the moved code is adding current_chunk to ggml_compute_state_shared I realized I could have avoided most of the moves by declaring functions ahead of time, hindsight is 20/20. If you'd prefer that, LMK, and I'll figure out how to return them to their original locations.

It would be very useful if the order is restored so that the diff only shows the changes, it's hard to review otherwise.

@cpumaxx
Copy link
Contributor

cpumaxx commented Apr 29, 2024

Hello, here's a bit of feedback from a larger machine (two socket 64 cores)
This branch gives me about a 20% speedup in the naive case, but I see a 15% performance regression with numa flags vs master. Since numa flags give me a 4x speedup vs. not, this speed reduction is very significant for my specific case.

@kunnis
Copy link
Contributor Author

kunnis commented Apr 29, 2024

It would be very useful if the order is restored so that the diff only shows the changes, it's hard to review otherwise.

Done

Hello, here's a bit of feedback from a larger machine (two socket 64 cores) This branch gives me about a 20% speedup in the naive case, but I see a 15% performance regression with numa flags vs master. Since numa flags give me a 4x speedup vs. not, this speed reduction is very significant for my specific case.

I'll ponder that case. Is the speed drop in PP or TG? How did you measure the change? Can you post llama-bench results? What sized model? If you want to really dig into it, compile it with GGML_PERF and find the section of the code that says These numbers are useful when trying to measure how well the threading scheduling works. and uncomment that, and run a llama-bench, then attach about 2000 lines of output (It'll be super chatty on your system). If possible, try to get logs when it's doing PP, and another when it's doing TG. It'll let me see if it's just a task size problem.

@slaren
Copy link
Member

slaren commented Apr 29, 2024

What numa flags are you using? For numa you probably want the same processors to always access the same fraction of the weights, so that it can be moved to the local memory of the node.

@kunnis
Copy link
Contributor Author

kunnis commented Apr 29, 2024

@cpumaxx Also, would you mind testing this branch and see if it's the same/better/worse then master with NUMA on? It'll let me know if it's the locking implementation, or if it's the different scheduling that's the problem.

@slaren
Copy link
Member

slaren commented Apr 30, 2024

13900k under WSL2:

model size backend threads test master t/s PR t/s speedup
llama 7B Q4_K - Medium 3.80 GiB CPU 8 pp 128 47.47 ± 0.84 56.79 ± 0.87 1.20
llama 7B Q4_K - Medium 3.80 GiB CPU 8 tg 32 16.01 ± 0.22 18.00 ± 0.10 1.12
llama 7B Q4_K - Medium 3.80 GiB CPU 16 pp 128 42.67 ± 0.27 72.97 ± 1.01 1.71
llama 7B Q4_K - Medium 3.80 GiB CPU 16 tg 32 14.45 ± 0.05 18.83 ± 0.03 1.30
llama 7B Q4_K - Medium 3.80 GiB CPU 24 pp 128 60.48 ± 0.37 85.12 ± 0.89 1.41
llama 7B Q4_K - Medium 3.80 GiB CPU 24 tg 32 16.17 ± 0.04 18.72 ± 0.04 1.16
llama 7B Q4_K - Medium 3.80 GiB CPU 32 pp 128 68.64 ± 0.58 84.54 ± 2.63 1.23
llama 7B Q4_K - Medium 3.80 GiB CPU 32 tg 32 14.20 ± 0.15 15.52 ± 0.08 1.09

So it's a very nice speedup for a CPU with e-cores. I didn't expect that it would also improve generation performance with 8 threads (=number of p-cores), I guess the OS scheduler is not doing a very good job here. Perplexity also looks good. I will try under native Windows as well later.

I think it is important to keep the option of using the previous fixed-scheduling type though, I expect will be especially important for numa systems, and it may also help with systems with only one type of CPU core.

@kunnis
Copy link
Contributor Author

kunnis commented Apr 30, 2024

@slaren Yeah, I'm doing all my testing under windows. The speedup is similar there. I think even with a cpu that's all the same cores, it's an across the board speedup. OSes don't guarantee perfect scheduling, so a blockwise approach like the PR should help in almost any case. Even on a dedicated machine, the OS will still handle interrupts on one processor, so you can't assume it will have 100% of all processors. I think the only case where this won't help is a single threaded environment. Plugging in some reasonable worst-case numbers, this is one atomic operation per 16x1 (the size of one block of a token generation vec * matrix) x 4096 (the size of many of the matrixes in a 7b param model).. so one atomic per 64k FP operations. That's pretty low. The metrics I have don't show the atomic as being a notable part of the overhead.

I think there's still an issue with the PR re cpumaxx's problem, but I may have a solution to it. I'll want to see what he reports back to my questions. I'd like to totally get rid of the fixed scheduler, I think the automatic scheduler should be able to do all the work. This should be a speed neutral or better even with NUMA enabled. I'm wondering if the slowdowns he's seeing aren't scheduler related, but instead related to the changes I made to ggml_graph_compute_thread_sync_node and ggml_graph_compute_thread_sync_task. If so, there's another solution that I think will give us the best of both worlds. If you can replicate his issue, I'll write up the alternate version if you know a way to test it.

@slaren
Copy link
Member

slaren commented Apr 30, 2024

13900k with MSVC

model size backend threads test master t/s PR t/s speedup
llama 7B Q4_K - Medium 3.80 GiB CPU 8 pp 128 42.76 ± 0.46 49.61 ± 0.11 1.16
llama 7B Q4_K - Medium 3.80 GiB CPU 8 tg 32 15.59 ± 0.23 17.23 ± 0.07 1.11
llama 7B Q4_K - Medium 3.80 GiB CPU 16 pp 128 37.72 ± 0.03 63.41 ± 0.82 1.68
llama 7B Q4_K - Medium 3.80 GiB CPU 16 tg 32 11.91 ± 0.04 16.37 ± 0.01 1.37
llama 7B Q4_K - Medium 3.80 GiB CPU 24 pp 128 52.54 ± 0.17 72.59 ± 0.29 1.38
llama 7B Q4_K - Medium 3.80 GiB CPU 24 tg 32 12.51 ± 0.02 14.99 ± 0.06 1.20
llama 7B Q4_K - Medium 3.80 GiB CPU 32 pp 128 33.28 ± 0.56 67.71 ± 1.78 2.03
llama 7B Q4_K - Medium 3.80 GiB CPU 32 tg 32 4.03 ± 0.11 6.50 ± 0.34 1.61

Similar speedup, but worse absolute performance.

@kunnis
Copy link
Contributor Author

kunnis commented Apr 30, 2024

@slaren Would you mind trying https://github.com/kunnis/llama.cpp/tree/MMThreadingPerfChangeWithMmPause as an alternative with your setup? It's a lighter weight change to ggml_graph_compute_thread_sync_node and ggml_graph_compute_thread_sync_task. It's not as fast for me as the PR version, but it's close. Maybe it's a better compromise between all use cases.

@ggerganov
Copy link
Member

@calculatortamer Slightly off-topic - does enabling flash-attention (-fa 1) improve the performance on any of these models on your orange pi?

@kunnis
Copy link
Contributor Author

kunnis commented May 14, 2024

@slaren Yeah. Is there a formatter I need to run or something?

@slaren
Copy link
Member

slaren commented May 14, 2024

The formatting needs to be done manually. Automatic formatters usually remove the vertical alignment, so they are not used.

@calculatortamer
Copy link

calculatortamer commented May 14, 2024

@calculatortamer Slightly off-topic - does enabling flash-attention (-fa 1) improve the performance on any of these models on your orange pi?

@ggerganov i did some speed test

./llama-bench -fa 0,1 -t 4,8 -r 2 -m Meta-Llama-3-8B-Instruct-IQ4_XS.gguf

model size params backend threads fa test t/s
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B CPU 4 0 pp 512 5.86 ± 0.00
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B CPU 4 0 tg 128 4.41 ± 0.03
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B CPU 8 0 pp 512 6.22 ± 0.00
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B CPU 8 0 tg 128 4.55 ± 0.10
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B CPU 4 1 pp 512 5.64 ± 0.00
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B CPU 4 1 tg 128 4.46 ± 0.01
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B CPU 8 1 pp 512 5.83 ± 0.00
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B CPU 8 1 tg 128 4.52 ± 0.06

build: bd80601 (2844)

conclusion: TG same within margin of error, smaller PP (~ -5%)

./llama-bench -fa 0,1 -t 4,8 -r 5 -m ../llama.cpp/models/TinyLlama-1.1B-intermediate-step-1431k-3T.IQ4_XS.gguf,../llama.cpp/models/tinyllama-1.1b-intermediate-step-1431k-3t.Q4_K_S.gguf

model size params backend threads fa test t/s
llama 1B IQ4_XS - 4.25 bpw 580.86 MiB 1.10 B CPU 4 0 pp 512 37.10 ± 0.07
llama 1B IQ4_XS - 4.25 bpw 580.86 MiB 1.10 B CPU 4 0 tg 128 20.58 ± 0.27
llama 1B IQ4_XS - 4.25 bpw 580.86 MiB 1.10 B CPU 8 0 pp 512 34.38 ± 0.01
llama 1B IQ4_XS - 4.25 bpw 580.86 MiB 1.10 B CPU 8 0 tg 128 26.16 ± 0.45
llama 1B IQ4_XS - 4.25 bpw 580.86 MiB 1.10 B CPU 4 1 pp 512 35.34 ± 0.01
llama 1B IQ4_XS - 4.25 bpw 580.86 MiB 1.10 B CPU 4 1 tg 128 20.13 ± 0.21
llama 1B IQ4_XS - 4.25 bpw 580.86 MiB 1.10 B CPU 8 1 pp 512 32.37 ± 0.02
llama 1B IQ4_XS - 4.25 bpw 580.86 MiB 1.10 B CPU 8 1 tg 128 25.54 ± 0.26
llama 1B Q4_K - Small 612.28 MiB 1.10 B CPU 4 0 pp 512 31.31 ± 0.08
llama 1B Q4_K - Small 612.28 MiB 1.10 B CPU 4 0 tg 128 18.92 ± 0.18
llama 1B Q4_K - Small 612.28 MiB 1.10 B CPU 8 0 pp 512 29.83 ± 1.53
llama 1B Q4_K - Small 612.28 MiB 1.10 B CPU 8 0 tg 128 24.31 ± 0.22
llama 1B Q4_K - Small 612.28 MiB 1.10 B CPU 4 1 pp 512 30.18 ± 0.02
llama 1B Q4_K - Small 612.28 MiB 1.10 B CPU 4 1 tg 128 18.66 ± 0.12
llama 1B Q4_K - Small 612.28 MiB 1.10 B CPU 8 1 pp 512 28.96 ± 0.02
llama 1B Q4_K - Small 612.28 MiB 1.10 B CPU 8 1 tg 128 23.11 ± 0.64

conclusion: smaller TG (equal at best), smaller PP

./llama-bench -fa 0,1 -t 4,8 -r 3 -m ../llama.cpp/models/Phi-3-mini-4k-instruct.IQ4_XS.gguf,../llama.cpp/models/Phi-3-mini-4k-instruct.Q4_K_S.gguf,../llama.cpp/models/Phi-3-mini-4k-instruct.Q4_0.gguf

model size params backend threads fa test t/s
phi3 3B IQ4_XS - 4.25 bpw 1.92 GiB 3.82 B CPU 4 0 pp 512 10.63 ± 0.03
phi3 3B IQ4_XS - 4.25 bpw 1.92 GiB 3.82 B CPU 4 0 tg 128 7.87 ± 0.07
phi3 3B IQ4_XS - 4.25 bpw 1.92 GiB 3.82 B CPU 8 0 pp 512 10.28 ± 0.28
phi3 3B IQ4_XS - 4.25 bpw 1.92 GiB 3.82 B CPU 8 0 tg 128 8.14 ± 0.07
phi3 3B IQ4_XS - 4.25 bpw 1.92 GiB 3.82 B CPU 4 1 pp 512 9.22 ± 0.15
phi3 3B IQ4_XS - 4.25 bpw 1.92 GiB 3.82 B CPU 4 1 tg 128 7.99 ± 0.07
phi3 3B IQ4_XS - 4.25 bpw 1.92 GiB 3.82 B CPU 8 1 pp 512 8.78 ± 0.01
phi3 3B IQ4_XS - 4.25 bpw 1.92 GiB 3.82 B CPU 8 1 tg 128 8.31 ± 0.04
phi3 3B Q4_K - Small 2.04 GiB 3.82 B CPU 4 0 pp 512 8.88 ± 0.00
phi3 3B Q4_K - Small 2.04 GiB 3.82 B CPU 4 0 tg 128 7.01 ± 0.02
phi3 3B Q4_K - Small 2.04 GiB 3.82 B CPU 8 0 pp 512 9.16 ± 0.02
phi3 3B Q4_K - Small 2.04 GiB 3.82 B CPU 8 0 tg 128 7.84 ± 0.05
phi3 3B Q4_K - Small 2.04 GiB 3.82 B CPU 4 1 pp 512 8.52 ± 0.00
phi3 3B Q4_K - Small 2.04 GiB 3.82 B CPU 4 1 tg 128 7.04 ± 0.03
phi3 3B Q4_K - Small 2.04 GiB 3.82 B CPU 8 1 pp 512 8.59 ± 0.00
phi3 3B Q4_K - Small 2.04 GiB 3.82 B CPU 8 1 tg 128 7.77 ± 0.09
phi3 3B Q4_0 2.03 GiB 3.82 B CPU 4 0 pp 512 10.59 ± 0.02
phi3 3B Q4_0 2.03 GiB 3.82 B CPU 4 0 tg 128 4.89 ± 0.17
phi3 3B Q4_0 2.03 GiB 3.82 B CPU 8 0 pp 512 2.70 ± 0.01
phi3 3B Q4_0 2.03 GiB 3.82 B CPU 8 0 tg 128 1.34 ± 0.04
phi3 3B Q4_0 2.03 GiB 3.82 B CPU 4 1 pp 512 9.87 ± 0.05
phi3 3B Q4_0 2.03 GiB 3.82 B CPU 4 1 tg 128 5.04 ± 0.07
phi3 3B Q4_0 2.03 GiB 3.82 B CPU 8 1 pp 512 3.55 ± 0.00
phi3 3B Q4_0 2.03 GiB 3.82 B CPU 8 1 tg 128 1.32 ± 0.04

build: bd80601 (2844)

conclusion IQ4_XS: better TG all of a sudden??? at a cost of a smaller PP
conclusion Q4_K_S: smaller TG (equal at best), smaller PP
conclusion Q4_0: smaller TG, smaller PP except at 8 cores which increased with FA

maybe phi-3 mini is the perfect balance for my orange pi?

rerun IQ4_XS phi3 model but with -r 10
./llama-bench -fa 0,1 -t 8 -r 10 -m ../llama.cpp/models/Phi-3-mini-4k-instruct.IQ4_XS.gguf

model size params backend threads fa test t/s
phi3 3B IQ4_XS - 4.25 bpw 1.92 GiB 3.82 B CPU 8 0 pp 512 10.50 ± 0.02
phi3 3B IQ4_XS - 4.25 bpw 1.92 GiB 3.82 B CPU 8 0 tg 128 8.21 ± 0.11
phi3 3B IQ4_XS - 4.25 bpw 1.92 GiB 3.82 B CPU 8 1 pp 512 9.58 ± 0.08
phi3 3B IQ4_XS - 4.25 bpw 1.92 GiB 3.82 B CPU 8 1 tg 128 8.23 ± 0.10

build: bd80601 (2844)

... nevermind

overall: no performance improvement, FA always made the PP smaller and TG smaller or equal at best
except in first run of phi3 3B IQ4_XS, TG increased slightly (+0.20t/s @ 8 cores), but PP was reduced (-2t/s @ 8 cores)
but the second run of phi3 3B IQ4_XS proved that TG increase was just within the margin of error, so no improvement and PP was still significantly reduced

also in phi3 Q4_0, PP increased @ 8 cores, but it's nothing when considering the quant's horrible performance compared to 4 cores

tldr: no, it only reduces the PP

i hope you find something interesting

@kunnis
Copy link
Contributor Author

kunnis commented May 14, 2024

The formatting needs to be done manually. Automatic formatters usually remove the vertical alignment, so they are not used.

Sure. I see several spots where my IDE changed spacing around asterisks. I was watching for indenting. I'll go through the diff and see if I spot any other whitespace changes, but is there anything else I should watch out for?

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some examples

@kunnis
Copy link
Contributor Author

kunnis commented May 14, 2024

Thanks. I'll knock those out.

@kunnis
Copy link
Contributor Author

kunnis commented May 14, 2024

I fixed the style I think.

I also found I'd left a bit of windows debug code I left in, and removed it. There's a patch that does a better job of something similar that's waiting to be merged in.

@kunnis
Copy link
Contributor Author

kunnis commented May 14, 2024

@slaren What are your thoughts on my next PR being about moving everything related to numa, thread configuration, barriers, atomics, pthread into a file named ggml-threading.h Should be a lot simpler. Is there a better place to discuss this?

@slaren
Copy link
Member

slaren commented May 14, 2024

I think that would be good, but you should check with @ggerganov. I think here is fine, but you can also start an issue or discussion if you want more broad feedback.

@slaren
Copy link
Member

slaren commented May 14, 2024

There are a few warnings when building without llamafile:

ggml.c: In function ‘ggml_compute_forward_mul_mat’:
ggml.c:12040:18: warning: unused variable ‘row_size’ [-Wunused-variable]
12040 |     const size_t row_size = ggml_row_size(vec_dot_type, ne10);
      |                  ^~~~~~~~
ggml.c:12039:18: warning: unused variable ‘wdata’ [-Wunused-variable]
12039 |     const void * wdata    = (src1->type == vec_dot_type) ? src1->data : params->wdata;
      |                  ^~~~~
ggml.c:11906:19: warning: unused variable ‘r3’ [-Wunused-variable]
11906 |     const int64_t r3 = ne13/ne03;
      |                   ^~
ggml.c:11905:19: warning: unused variable ‘r2’ [-Wunused-variable]
11905 |     const int64_t r2 = ne12/ne02;
      |                   ^~
ggml.c:11883:16: warning: unused variable ‘src1_cont’ [-Wunused-variable]
11883 |     const bool src1_cont = ggml_is_contiguous(src1);
      |

@kunnis
Copy link
Contributor Author

kunnis commented May 14, 2024

I'll update the PR to address those. I thought I'd checked that case.

@ggerganov
Copy link
Member

@slaren What are your thoughts on my next PR being about moving everything related to numa, thread configuration, barriers, atomics, pthread into a file named ggml-threading.h Should be a lot simpler. Is there a better place to discuss this?

I think that would be good, but you should check with @ggerganov. I think here is fine, but you can also start an issue or discussion if you want more broad feedback.

Yes, it's OK to add ggml-threading.h/.c

@slaren slaren changed the title Draft Idea... CPU Inference... This seems to perform better? ggml : use dynamic thread scheduling for matrix multiplication May 15, 2024
@slaren slaren requested a review from ggerganov May 15, 2024 16:46
@slaren slaren merged commit e1b40ac into ggml-org:master May 15, 2024
60 of 65 checks passed
teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 17, 2024
…org#6915)

* Just reordering some structs.

* Adding in the calls to mm_pause

* Passing around the state

* Renaming and moving a bunch of variables around.

* Extracting the logic to it's own function.

* Moving some variable definitions into the chunk function.

* Moving some variables around

* moving src1_cont inside

* Moving row_size

* adding the current_chunk

* Reorg the code.

* Formatting to match the orig patch

* starting to setup the chunking variables

* Starting the buildup of the loop

* The yield shouldn't be necessary.

* adding the looping structure based on the chunk configuration.

* Add in the re-chunking code.

* Making it much more likely to rechunk.

* disable resizing if numa is enabled.

* Updating comments with what we've learned.

* Fix formatting

* Couple more formatting fixes.

* More style fixes.

* Fix Warnings

* Going with unused because there's conditional logic that needs it.

* Update ggml.c

* Update ggml.c

---------
@kunnis kunnis deleted the MMThreadingPerfChange branch May 20, 2024 19:28
jart added a commit to jart/llama.cpp that referenced this pull request May 22, 2024
@kunnis kunnis mentioned this pull request May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet