-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi GPU --split-mode row
speed regression
#6476
Comments
--split-mode row
speed regression
With a large model, like 120b q3k_m, there is a small speed increase in token generation speed 3,5 t/s vs 3 t/s with layer split. |
I am observing no performance difference whatsoever on 3x P40. |
+1 I have 2x A6000 and an nvlink bridge. Using I also used to be able to get some improvement in evaluation speed by upping the batch size to 1024 or 2048 too, but this now actually slightly reduces my tokens/s for both "row" and "layer" modes. |
Have you also played with |
No - I never seen that option until now! Just looked in the code and see:
What is the difference and has |
Thanks! Also found a couple of other threads about this: I'll have a play with it tomorrow and report back. |
#6263 also looks important to consider. |
Just a quick follow-up to say using I'm now actually getting slightly better improvement from using |
Great to here, but from what I understand from @slaren is ubatch-size now aims to be 256/512 max, not equals to logical batch size. Please confirm |
There is no problem increasing ubatch-size if it improves performance on some hardware. |
I just set to 1024 for both as that is what I used to set the old
I have 2 x A6000 and an NVLink bridge. They are in a dual CPU board in PCI-e 3.0 16x slots not on the same CPU. I think that probably means I get such a large speed increase from using From memory vs a 1-2 month old version of llama.cpp I think both |
Since
b2475
row split and layer split has the same performance.llama-bench
is not affected, butmain
andserver
has this regression.b2474
b2475
b2600
Commands used:
HIP_VISIBLE_DEVICES=0,1,2 ./llama-bench -sm row -m /home/user/text-generation-webui/models/lzlv_70b_fp16_hf.Q4_K_M.gguf
HIP_VISIBLE_DEVICES=0,1,2 ./main -m /home/user/text-generation-webui/models/lzlv_70b_fp16_hf.Q4_K_M.gguf -t 4 -ngl 99 --seed 1234 -n 128 --ignore-eos -p "USER: Tell me a joke ASSISTANT: " --split-mode row
HIP_VISIBLE_DEVICES=0,1,2 ./server -t 4 -ngl 99 -sm row -m /home/user/text-generation-webui/models/lzlv_70b_fp16_hf.Q4_K_M.gguf -c 4096 -ts 8,10,10 -b 512 --port 8080 --host 192.168.0.87
The text was updated successfully, but these errors were encountered: