-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352
Comments
The same bug also happens with RekaAI/reka-flash-3. |
Also encountered the same problem. |
Try a build without this option. If you don't remember enabling it, delete the build directory first and reconfigure cmake. |
Unfortunately did not change a thing ./bin/llama-server -m '/home/luis/Downloads/llama.cpp-b4876/models/gemma-3-12b-it-Q5_K_M.gguf' --n-gpu-layers -1 --cache-type-k q8_0 --cache-type-v q8_0 --batch_size 1024 --flash-attn -c 4000 --port 7777 -t 8 -ngl 99 system_info: n_threads = 8 (n_threads_batch = 8) / 24 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | main: HTTP server is listening, hostname: 127.0.0.1, port: 7777, http threads: 23 ' -%} ' -%} Hello<end_of_turn> |
The large number of graph splits indicates that there is some operation that is not supported by the CUDA backend, and is being run on the CPU. If you set the environment variable |
This is the log with your env variables and -v SPLIT #0: CPU # 0 inputsnode # 0 ( GET_ROWS): inp_embd ( 15K) [ CPU ]: token_embd.weight ( 787M) [ CPU ] inp_tokens ( 0K) [ CPU ] SPLIT #1: CUDA0 # 3 inputs: [inp_embd ( 15K)] [inp_pos ( 0K)] [KQ_mask_swa ( 64K)]node # 1 ( SCALE): inp_scaled ( 15K) [CUDA0 ]: CUDA0#inp_embd#0 ( 15K) [ NULL ] SPLIT #2: CPU # 4 inputs: [q-0 ( 16K)] [k-0 ( 544K)] [v-0 ( 544K)] [KQ_mask_swa (copy) ( 32K)]node # 23 (FLASH_ATTN): node_23 ( 16K) [ CPU ]: CPU#q-0#0 ( 16K) [ NULL ] CPU#k-0#0 ( 544K) [ NULL ] CPU#v-0#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #3: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node # 25 ( MUL_MAT): kqv_out-0 ( 15K) [CUDA0 ]: blk.0.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #4: CPU # 3 inputs: [q-1 ( 16K)] [k-1 ( 544K)] [v-1 ( 544K)]node # 59 (FLASH_ATTN): node_59 ( 16K) [ CPU ]: CPU#q-1#0 ( 16K) [ NULL ] CPU#k-1#0 ( 544K) [ NULL ] CPU#v-1#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #5: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node # 61 ( MUL_MAT): kqv_out-1 ( 15K) [CUDA0 ]: blk.1.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #6: CPU # 3 inputs: [q-2 ( 16K)] [k-2 ( 544K)] [v-2 ( 544K)]node # 95 (FLASH_ATTN): node_95 ( 16K) [ CPU ]: CPU#q-2#0 ( 16K) [ NULL ] CPU#k-2#0 ( 544K) [ NULL ] CPU#v-2#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #7: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node # 97 ( MUL_MAT): kqv_out-2 ( 15K) [CUDA0 ]: blk.2.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #8: CPU # 3 inputs: [q-3 ( 16K)] [k-3 ( 544K)] [v-3 ( 544K)]node #131 (FLASH_ATTN): node_131 ( 16K) [ CPU ]: CPU#q-3#0 ( 16K) [ NULL ] CPU#k-3#0 ( 544K) [ NULL ] CPU#v-3#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #9: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #133 ( MUL_MAT): kqv_out-3 ( 15K) [CUDA0 ]: blk.3.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #10: CPU # 3 inputs: [q-4 ( 16K)] [k-4 ( 544K)] [v-4 ( 544K)]node #167 (FLASH_ATTN): node_167 ( 16K) [ CPU ]: CPU#q-4#0 ( 16K) [ NULL ] CPU#k-4#0 ( 544K) [ NULL ] CPU#v-4#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #11: CUDA0 # 2 inputs: [ (reshaped) ( 16K)] [KQ_mask ( 64K)]node #169 ( MUL_MAT): kqv_out-4 ( 15K) [CUDA0 ]: blk.4.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #12: CPU # 4 inputs: [q-5 ( 16K)] [k-5 ( 544K)] [v-5 ( 544K)] [KQ_mask (copy) ( 32K)]node #204 (FLASH_ATTN): node_204 ( 16K) [ CPU ]: CPU#q-5#0 ( 16K) [ NULL ] CPU#k-5#0 ( 544K) [ NULL ] CPU#v-5#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ] SPLIT #13: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #206 ( MUL_MAT): kqv_out-5 ( 15K) [CUDA0 ]: blk.5.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #14: CPU # 3 inputs: [q-6 ( 16K)] [k-6 ( 544K)] [v-6 ( 544K)]node #240 (FLASH_ATTN): node_240 ( 16K) [ CPU ]: CPU#q-6#0 ( 16K) [ NULL ] CPU#k-6#0 ( 544K) [ NULL ] CPU#v-6#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #15: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #242 ( MUL_MAT): kqv_out-6 ( 15K) [CUDA0 ]: blk.6.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #16: CPU # 3 inputs: [q-7 ( 16K)] [k-7 ( 544K)] [v-7 ( 544K)]node #276 (FLASH_ATTN): node_276 ( 16K) [ CPU ]: CPU#q-7#0 ( 16K) [ NULL ] CPU#k-7#0 ( 544K) [ NULL ] CPU#v-7#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #17: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #278 ( MUL_MAT): kqv_out-7 ( 15K) [CUDA0 ]: blk.7.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #18: CPU # 3 inputs: [q-8 ( 16K)] [k-8 ( 544K)] [v-8 ( 544K)]node #312 (FLASH_ATTN): node_312 ( 16K) [ CPU ]: CPU#q-8#0 ( 16K) [ NULL ] CPU#k-8#0 ( 544K) [ NULL ] CPU#v-8#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #19: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #314 ( MUL_MAT): kqv_out-8 ( 15K) [CUDA0 ]: blk.8.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #20: CPU # 3 inputs: [q-9 ( 16K)] [k-9 ( 544K)] [v-9 ( 544K)]node #348 (FLASH_ATTN): node_348 ( 16K) [ CPU ]: CPU#q-9#0 ( 16K) [ NULL ] CPU#k-9#0 ( 544K) [ NULL ] CPU#v-9#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #21: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #350 ( MUL_MAT): kqv_out-9 ( 15K) [CUDA0 ]: blk.9.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #22: CPU # 3 inputs: [q-10 ( 16K)] [k-10 ( 544K)] [v-10 ( 544K)]node #384 (FLASH_ATTN): node_384 ( 16K) [ CPU ]: CPU#q-10#0 ( 16K) [ NULL ] CPU#k-10#0 ( 544K) [ NULL ] CPU#v-10#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #23: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #386 ( MUL_MAT): kqv_out-10 ( 15K) [CUDA0 ]: blk.10.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #24: CPU # 3 inputs: [q-11 ( 16K)] [k-11 ( 544K)] [v-11 ( 544K)]node #420 (FLASH_ATTN): node_420 ( 16K) [ CPU ]: CPU#q-11#0 ( 16K) [ NULL ] CPU#k-11#0 ( 544K) [ NULL ] CPU#v-11#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ] SPLIT #25: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #422 ( MUL_MAT): kqv_out-11 ( 15K) [CUDA0 ]: blk.11.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #26: CPU # 3 inputs: [q-12 ( 16K)] [k-12 ( 544K)] [v-12 ( 544K)]node #456 (FLASH_ATTN): node_456 ( 16K) [ CPU ]: CPU#q-12#0 ( 16K) [ NULL ] CPU#k-12#0 ( 544K) [ NULL ] CPU#v-12#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #27: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #458 ( MUL_MAT): kqv_out-12 ( 15K) [CUDA0 ]: blk.12.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #28: CPU # 3 inputs: [q-13 ( 16K)] [k-13 ( 544K)] [v-13 ( 544K)]node #492 (FLASH_ATTN): node_492 ( 16K) [ CPU ]: CPU#q-13#0 ( 16K) [ NULL ] CPU#k-13#0 ( 544K) [ NULL ] CPU#v-13#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #29: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #494 ( MUL_MAT): kqv_out-13 ( 15K) [CUDA0 ]: blk.13.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #30: CPU # 3 inputs: [q-14 ( 16K)] [k-14 ( 544K)] [v-14 ( 544K)]node #528 (FLASH_ATTN): node_528 ( 16K) [ CPU ]: CPU#q-14#0 ( 16K) [ NULL ] CPU#k-14#0 ( 544K) [ NULL ] CPU#v-14#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #31: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #530 ( MUL_MAT): kqv_out-14 ( 15K) [CUDA0 ]: blk.14.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #32: CPU # 3 inputs: [q-15 ( 16K)] [k-15 ( 544K)] [v-15 ( 544K)]node #564 (FLASH_ATTN): node_564 ( 16K) [ CPU ]: CPU#q-15#0 ( 16K) [ NULL ] CPU#k-15#0 ( 544K) [ NULL ] CPU#v-15#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #33: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #566 ( MUL_MAT): kqv_out-15 ( 15K) [CUDA0 ]: blk.15.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #34: CPU # 3 inputs: [q-16 ( 16K)] [k-16 ( 544K)] [v-16 ( 544K)]node #600 (FLASH_ATTN): node_600 ( 16K) [ CPU ]: CPU#q-16#0 ( 16K) [ NULL ] CPU#k-16#0 ( 544K) [ NULL ] CPU#v-16#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #35: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #602 ( MUL_MAT): kqv_out-16 ( 15K) [CUDA0 ]: blk.16.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #36: CPU # 3 inputs: [q-17 ( 16K)] [k-17 ( 544K)] [v-17 ( 544K)]node #636 (FLASH_ATTN): node_636 ( 16K) [ CPU ]: CPU#q-17#0 ( 16K) [ NULL ] CPU#k-17#0 ( 544K) [ NULL ] CPU#v-17#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ] SPLIT #37: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #638 ( MUL_MAT): kqv_out-17 ( 15K) [CUDA0 ]: blk.17.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #38: CPU # 3 inputs: [q-18 ( 16K)] [k-18 ( 544K)] [v-18 ( 544K)]node #672 (FLASH_ATTN): node_672 ( 16K) [ CPU ]: CPU#q-18#0 ( 16K) [ NULL ] CPU#k-18#0 ( 544K) [ NULL ] CPU#v-18#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #39: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #674 ( MUL_MAT): kqv_out-18 ( 15K) [CUDA0 ]: blk.18.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #40: CPU # 3 inputs: [q-19 ( 16K)] [k-19 ( 544K)] [v-19 ( 544K)]node #708 (FLASH_ATTN): node_708 ( 16K) [ CPU ]: CPU#q-19#0 ( 16K) [ NULL ] CPU#k-19#0 ( 544K) [ NULL ] CPU#v-19#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #41: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #710 ( MUL_MAT): kqv_out-19 ( 15K) [CUDA0 ]: blk.19.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #42: CPU # 3 inputs: [q-20 ( 16K)] [k-20 ( 544K)] [v-20 ( 544K)]node #744 (FLASH_ATTN): node_744 ( 16K) [ CPU ]: CPU#q-20#0 ( 16K) [ NULL ] CPU#k-20#0 ( 544K) [ NULL ] CPU#v-20#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #43: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #746 ( MUL_MAT): kqv_out-20 ( 15K) [CUDA0 ]: blk.20.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #44: CPU # 3 inputs: [q-21 ( 16K)] [k-21 ( 544K)] [v-21 ( 544K)]node #780 (FLASH_ATTN): node_780 ( 16K) [ CPU ]: CPU#q-21#0 ( 16K) [ NULL ] CPU#k-21#0 ( 544K) [ NULL ] CPU#v-21#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #45: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #782 ( MUL_MAT): kqv_out-21 ( 15K) [CUDA0 ]: blk.21.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #46: CPU # 3 inputs: [q-22 ( 16K)] [k-22 ( 544K)] [v-22 ( 544K)]node #816 (FLASH_ATTN): node_816 ( 16K) [ CPU ]: CPU#q-22#0 ( 16K) [ NULL ] CPU#k-22#0 ( 544K) [ NULL ] CPU#v-22#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #47: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #818 ( MUL_MAT): kqv_out-22 ( 15K) [CUDA0 ]: blk.22.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #48: CPU # 3 inputs: [q-23 ( 16K)] [k-23 ( 544K)] [v-23 ( 544K)]node #852 (FLASH_ATTN): node_852 ( 16K) [ CPU ]: CPU#q-23#0 ( 16K) [ NULL ] CPU#k-23#0 ( 544K) [ NULL ] CPU#v-23#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ] SPLIT #49: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #854 ( MUL_MAT): kqv_out-23 ( 15K) [CUDA0 ]: blk.23.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #50: CPU # 3 inputs: [q-24 ( 16K)] [k-24 ( 544K)] [v-24 ( 544K)]node #888 (FLASH_ATTN): node_888 ( 16K) [ CPU ]: CPU#q-24#0 ( 16K) [ NULL ] CPU#k-24#0 ( 544K) [ NULL ] CPU#v-24#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #51: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #890 ( MUL_MAT): kqv_out-24 ( 15K) [CUDA0 ]: blk.24.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #52: CPU # 3 inputs: [q-25 ( 16K)] [k-25 ( 544K)] [v-25 ( 544K)]node #924 (FLASH_ATTN): node_924 ( 16K) [ CPU ]: CPU#q-25#0 ( 16K) [ NULL ] CPU#k-25#0 ( 544K) [ NULL ] CPU#v-25#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #53: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #926 ( MUL_MAT): kqv_out-25 ( 15K) [CUDA0 ]: blk.25.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #54: CPU # 3 inputs: [q-26 ( 16K)] [k-26 ( 544K)] [v-26 ( 544K)]node #960 (FLASH_ATTN): node_960 ( 16K) [ CPU ]: CPU#q-26#0 ( 16K) [ NULL ] CPU#k-26#0 ( 544K) [ NULL ] CPU#v-26#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #55: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #962 ( MUL_MAT): kqv_out-26 ( 15K) [CUDA0 ]: blk.26.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #56: CPU # 3 inputs: [q-27 ( 16K)] [k-27 ( 544K)] [v-27 ( 544K)]node #996 (FLASH_ATTN): node_996 ( 16K) [ CPU ]: CPU#q-27#0 ( 16K) [ NULL ] CPU#k-27#0 ( 544K) [ NULL ] CPU#v-27#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #57: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #998 ( MUL_MAT): kqv_out-27 ( 15K) [CUDA0 ]: blk.27.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #58: CPU # 3 inputs: [q-28 ( 16K)] [k-28 ( 544K)] [v-28 ( 544K)]node #1032 (FLASH_ATTN): node_1032 ( 16K) [ CPU ]: CPU#q-28#0 ( 16K) [ NULL ] CPU#k-28#0 ( 544K) [ NULL ] CPU#v-28#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #59: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1034 ( MUL_MAT): kqv_out-28 ( 15K) [CUDA0 ]: blk.28.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #60: CPU # 3 inputs: [q-29 ( 16K)] [k-29 ( 544K)] [v-29 ( 544K)]node #1068 (FLASH_ATTN): node_1068 ( 16K) [ CPU ]: CPU#q-29#0 ( 16K) [ NULL ] CPU#k-29#0 ( 544K) [ NULL ] CPU#v-29#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ] SPLIT #61: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1070 ( MUL_MAT): kqv_out-29 ( 15K) [CUDA0 ]: blk.29.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #62: CPU # 3 inputs: [q-30 ( 16K)] [k-30 ( 544K)] [v-30 ( 544K)]node #1104 (FLASH_ATTN): node_1104 ( 16K) [ CPU ]: CPU#q-30#0 ( 16K) [ NULL ] CPU#k-30#0 ( 544K) [ NULL ] CPU#v-30#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #63: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1106 ( MUL_MAT): kqv_out-30 ( 15K) [CUDA0 ]: blk.30.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #64: CPU # 3 inputs: [q-31 ( 16K)] [k-31 ( 544K)] [v-31 ( 544K)]node #1140 (FLASH_ATTN): node_1140 ( 16K) [ CPU ]: CPU#q-31#0 ( 16K) [ NULL ] CPU#k-31#0 ( 544K) [ NULL ] CPU#v-31#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #65: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1142 ( MUL_MAT): kqv_out-31 ( 15K) [CUDA0 ]: blk.31.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #66: CPU # 3 inputs: [q-32 ( 16K)] [k-32 ( 544K)] [v-32 ( 544K)]node #1176 (FLASH_ATTN): node_1176 ( 16K) [ CPU ]: CPU#q-32#0 ( 16K) [ NULL ] CPU#k-32#0 ( 544K) [ NULL ] CPU#v-32#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #67: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1178 ( MUL_MAT): kqv_out-32 ( 15K) [CUDA0 ]: blk.32.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #68: CPU # 3 inputs: [q-33 ( 16K)] [k-33 ( 544K)] [v-33 ( 544K)]node #1212 (FLASH_ATTN): node_1212 ( 16K) [ CPU ]: CPU#q-33#0 ( 16K) [ NULL ] CPU#k-33#0 ( 544K) [ NULL ] CPU#v-33#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #69: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1214 ( MUL_MAT): kqv_out-33 ( 15K) [CUDA0 ]: blk.33.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #70: CPU # 3 inputs: [q-34 ( 16K)] [k-34 ( 544K)] [v-34 ( 544K)]node #1248 (FLASH_ATTN): node_1248 ( 16K) [ CPU ]: CPU#q-34#0 ( 16K) [ NULL ] CPU#k-34#0 ( 544K) [ NULL ] CPU#v-34#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #71: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1250 ( MUL_MAT): kqv_out-34 ( 15K) [CUDA0 ]: blk.34.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #72: CPU # 3 inputs: [q-35 ( 16K)] [k-35 ( 544K)] [v-35 ( 544K)]node #1284 (FLASH_ATTN): node_1284 ( 16K) [ CPU ]: CPU#q-35#0 ( 16K) [ NULL ] CPU#k-35#0 ( 544K) [ NULL ] CPU#v-35#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ] SPLIT #73: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1286 ( MUL_MAT): kqv_out-35 ( 15K) [CUDA0 ]: blk.35.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #74: CPU # 3 inputs: [q-36 ( 16K)] [k-36 ( 544K)] [v-36 ( 544K)]node #1320 (FLASH_ATTN): node_1320 ( 16K) [ CPU ]: CPU#q-36#0 ( 16K) [ NULL ] CPU#k-36#0 ( 544K) [ NULL ] CPU#v-36#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #75: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1322 ( MUL_MAT): kqv_out-36 ( 15K) [CUDA0 ]: blk.36.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #76: CPU # 3 inputs: [q-37 ( 16K)] [k-37 ( 544K)] [v-37 ( 544K)]node #1356 (FLASH_ATTN): node_1356 ( 16K) [ CPU ]: CPU#q-37#0 ( 16K) [ NULL ] CPU#k-37#0 ( 544K) [ NULL ] CPU#v-37#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #77: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1358 ( MUL_MAT): kqv_out-37 ( 15K) [CUDA0 ]: blk.37.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #78: CPU # 3 inputs: [q-38 ( 16K)] [k-38 ( 544K)] [v-38 ( 544K)]node #1392 (FLASH_ATTN): node_1392 ( 16K) [ CPU ]: CPU#q-38#0 ( 16K) [ NULL ] CPU#k-38#0 ( 544K) [ NULL ] CPU#v-38#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #79: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1394 ( MUL_MAT): kqv_out-38 ( 15K) [CUDA0 ]: blk.38.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #80: CPU # 3 inputs: [q-39 ( 16K)] [k-39 ( 544K)] [v-39 ( 544K)]node #1428 (FLASH_ATTN): node_1428 ( 16K) [ CPU ]: CPU#q-39#0 ( 16K) [ NULL ] CPU#k-39#0 ( 544K) [ NULL ] CPU#v-39#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #81: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1430 ( MUL_MAT): kqv_out-39 ( 15K) [CUDA0 ]: blk.39.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #82: CPU # 3 inputs: [q-40 ( 16K)] [k-40 ( 544K)] [v-40 ( 544K)]node #1464 (FLASH_ATTN): node_1464 ( 16K) [ CPU ]: CPU#q-40#0 ( 16K) [ NULL ] CPU#k-40#0 ( 544K) [ NULL ] CPU#v-40#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #83: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1466 ( MUL_MAT): kqv_out-40 ( 15K) [CUDA0 ]: blk.40.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #84: CPU # 3 inputs: [q-41 ( 16K)] [k-41 ( 544K)] [v-41 ( 544K)]node #1500 (FLASH_ATTN): node_1500 ( 16K) [ CPU ]: CPU#q-41#0 ( 16K) [ NULL ] CPU#k-41#0 ( 544K) [ NULL ] CPU#v-41#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ] SPLIT #85: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1502 ( MUL_MAT): kqv_out-41 ( 15K) [CUDA0 ]: blk.41.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #86: CPU # 3 inputs: [q-42 ( 16K)] [k-42 ( 544K)] [v-42 ( 544K)]node #1536 (FLASH_ATTN): node_1536 ( 16K) [ CPU ]: CPU#q-42#0 ( 16K) [ NULL ] CPU#k-42#0 ( 544K) [ NULL ] CPU#v-42#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #87: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1538 ( MUL_MAT): kqv_out-42 ( 15K) [CUDA0 ]: blk.42.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #88: CPU # 3 inputs: [q-43 ( 16K)] [k-43 ( 544K)] [v-43 ( 544K)]node #1572 (FLASH_ATTN): node_1572 ( 16K) [ CPU ]: CPU#q-43#0 ( 16K) [ NULL ] CPU#k-43#0 ( 544K) [ NULL ] CPU#v-43#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #89: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1574 ( MUL_MAT): kqv_out-43 ( 15K) [CUDA0 ]: blk.43.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #90: CPU # 3 inputs: [q-44 ( 16K)] [k-44 ( 544K)] [v-44 ( 544K)]node #1608 (FLASH_ATTN): node_1608 ( 16K) [ CPU ]: CPU#q-44#0 ( 16K) [ NULL ] CPU#k-44#0 ( 544K) [ NULL ] CPU#v-44#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #91: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1610 ( MUL_MAT): kqv_out-44 ( 15K) [CUDA0 ]: blk.44.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #92: CPU # 3 inputs: [q-45 ( 16K)] [k-45 ( 544K)] [v-45 ( 544K)]node #1644 (FLASH_ATTN): node_1644 ( 16K) [ CPU ]: CPU#q-45#0 ( 16K) [ NULL ] CPU#k-45#0 ( 544K) [ NULL ] CPU#v-45#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #93: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1646 ( MUL_MAT): kqv_out-45 ( 15K) [CUDA0 ]: blk.45.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #94: CPU # 3 inputs: [q-46 ( 16K)] [k-46 ( 544K)] [v-46 ( 544K)]node #1680 (FLASH_ATTN): node_1680 ( 16K) [ CPU ]: CPU#q-46#0 ( 16K) [ NULL ] CPU#k-46#0 ( 544K) [ NULL ] CPU#v-46#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ] SPLIT #95: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]node #1682 ( MUL_MAT): kqv_out-46 ( 15K) [CUDA0 ]: blk.46.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] SPLIT #96: CPU # 3 inputs: [q-47 ( 16K)] [k-47 ( 544K)] [v-47 ( 544K)]node #1716 (FLASH_ATTN): node_1716 ( 16K) [ CPU ]: CPU#q-47#0 ( 16K) [ NULL ] CPU#k-47#0 ( 544K) [ NULL ] CPU#v-47#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ] SPLIT #97: CUDA0 # 2 inputs: [ (reshaped) ( 16K)] [inp_out_ids ( 0K)]node #1718 ( MUL_MAT): kqv_out-47 ( 15K) [CUDA0 ]: blk.47.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] data stream, to_send: data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1741785946,"id":"chatcmpl-9Qb2GeBPXWugyHyvaWal9u90NNwaReFo","model":"gpt-3.5-turbo","system_fingerprint":"b0-unknown","object":"chat.completion.chunk","usage":{"completion_tokens":22,"prompt_tokens":17,"total_tokens":39},"timings":{"prompt_n":13,"prompt_ms":75.857,"prompt_per_token_ms":5.835153846153846,"prompt_per_second":171.37508733538104,"predicted_n":22,"predicted_ms":945.028,"predicted_per_token_ms":42.95581818181818,"predicted_per_second":23.279733510541487}} srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200 |
So it is the flash attention. This is probably because this head size (256) is only supported with F16. Not sure if this is because it is not commonly used, or there is some performance issue that makes it unusable, @JohannesGaessler should know more. You should still be able to use K quantization with flash attention disabled. |
Without the flash attention flag it does not load at all unfortunately. The command: ./bin/llama-server -m '/home/luis/Downloads/gemma-3-12b-it-Q4_K_M.gguf' --n-gpu-layers -1 --batch_size 1024 --cache-type-k q8_0 --cache-type-v q8_0 -c 8000 --port 7777 -t 8 -ngl 99 -v the logs: |
Without flash attn you can only quantize K, but not V. You need to remove the |
I had encounter the same problem, seems it auto kv offload even without -nkvo option |
Models with the same parameter count run significantly faster, and even models with a larger parameter count perform better than Gemma 3. Related issue: GitHub Issue #9701 |
try |
changes nothing, same output |
I'm seeing the same on my older P6000. |
Same case, while I can run 12b models easily, gemma3 12b gets its cache offloaded. And not having v cache quantized is not an option for low vram situations. If I use --flash-attn -ctk q4_0 -ctv q4_0 the prompt processing is done in CPU. If I user --flash-attn -ctk q4_0 the promt processing is offloaded but the vram consumption skyrockets out of the gpu capacity and that is slow as molases. Changing topic a bit, I am really grateful for all the comunity efforts on llama.cpp. |
Inference speed double slow, when use q8_0 cache. 16 token/sec unquantized vs 8 token/sec with q8_0 kv cache. In mistral nemo this ~5% slower. |
Up! Very sad bug: fat context, but quantized is kills inference speed. =( |
The problem is register pressure. Head size 256 needs more registers than head size 128 and a quantized KV cache also needs more registers than an FP16 KV cache. If you combine the two the current kernel simply runs out of registers and the performance is effectively unusable which is why the CUDA backend does not support it. The code would need to be specifically rewritten for that use case to make it usable. |
Thx for your answer! |
Name and Version
./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 0 (unknown)
built with cc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2) for x86_64-redhat-linux
(newest b4876 version)
Operating systems
Linux
GGML backends
CUDA
Hardware
Ryzen 3900x + rtx 3060 12gb
Models
Gemma-3-12b_Q5_K_M
Problem description & steps to reproduce
Prompt eval time is way slower when using quantized kv cache than standard kv cache. Also I see that the cpu is used when the quantized kv cache is turned on. So I believe that the kv cache is not properly processed by the gpu if the quantized kv cache is provided
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: