Run several single thread operators parellel #850

howard0su · 2023-04-08T13:44:50Z

In 18 threads testing, it shows about 5% performance gain.

howard0su · 2023-04-08T13:48:09Z

In my testing, this will give noticable difference when running in high # of threads:

Before

Running with 18 threads...
         18 threads | run 1/4 | current token time 237.93 ms - eval time 32116.6 ms - prompt eval time 1903.46 ms
         18 threads | run 2/4 | current token time 223.18 ms - eval time 33081.06 ms - prompt eval time 1785.41 ms
         18 threads | run 3/4 | current token time 560.18 ms - eval time 42127.41 ms - prompt eval time 4481.46 ms
         18 threads | run 4/4 | current token time 226.64 ms - eval time 32145.81 ms - prompt eval time 1813.12 ms

After

Running with 18 threads...
         18 threads | run 1/4 | current token time 221.68 ms - eval time 30907.99 ms - prompt eval time 1773.44 ms
         18 threads | run 2/4 | current token time 225.93 ms - eval time 31100.67 ms - prompt eval time 1807.41 ms
         18 threads | run 3/4 | current token time 222.29 ms - eval time 30184.69 ms - prompt eval time 1778.33 ms
         18 threads | run 4/4 | current token time 233.21 ms - eval time 31018.9 ms - prompt eval time 1865.65 ms

howard0su · 2023-04-08T13:49:35Z

ggml.c

+        } else {
+            int start = i;
+            int end = i + 1;
+            while (end < cgraph->n_nodes && dispath_threads < n_threads && (end - start) < 4)


magic number 4 needs some tuning.

howard0su · 2023-04-09T13:11:15Z

Eval time:

+--------------------------------------------------------------------------+
|+                 +  + +                         x                       x|
|     |__________A____M____|                |_____M_______A_____________|  |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   3       32116.6      33081.06      32145.81     32447.823     548.59349
+   4      30184.69      31100.67       31018.9     30803.062     419.74213
Difference at 95.0% confidence
        -1644.76 +/- 933.691
        -5.06894% +/- 2.87751%
        (Student's t, pooled s = 475.491)

jon-chuang · 2023-04-13T08:16:02Z

In 18 threads testing, it shows about 5% performance gain.

18 threads on a how many C/T machine?

howard0su · 2023-04-13T17:14:46Z

In 18 threads testing, it shows about 5% performance gain.

18 threads on a how many C/T machine?

10cores 20threads win10 box

ivanstepanovftw · 2023-05-05T18:11:50Z

Getting segfault

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/LLaMA/7B/ggml-model-q4_0.bin --prompt "William Safire will walk us through the nuances of bad" --threads 1 --seed 1 --n_predict 16 --tfs 0.97 --mirostat 2 --mirostat_ent 5
main: build = 513 (11702ed)
main: seed  = 1
llama.cpp: loading model from models/LLaMA/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 1 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 0, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 0.970000, top_p = 1.000000, typical_p = 1.000000, temp = 1.000000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 8, n_predict = 16, n_keep = 0


 William Safire will walk us through/p/i/llama.cpp/ggml.c:12128:49: runtime error: member access within null pointer of type 'struct ggml_compute_state'
pc_0x508ca7###func_ggml_graph_compute###file_/p/i/llama.cpp/ggml.c###line_12128###obj_(main+0x508ca7)
pc_0x56e9c8###func_llama_eval_internal###file_/p/i/llama.cpp/llama.cpp###line_1272###obj_(main+0x56e9c8)
pc_0x571ff0###func_llama_eval###file_/p/i/llama.cpp/llama.cpp###line_2726###obj_(main+0x571ff0)
pc_0x41a78a###func_main###file_/p/i/llama.cpp/examples/main/main.cpp###line_360###obj_(main+0x41a78a)
pc_0x7ff119a4a50f###func___libc_start_call_main###file_<null>###line_0###obj_(libc.so.6+0x2750f)
pc_0x7ff119a4a5c8###func___libc_start_main@GLIBC_2.2.5###file_<null>###line_0###obj_(libc.so.6+0x275c8)
pc_0x423624###func__start###file_<null>###line_0###obj_(main+0x423624)

AddressSanitizer:DEADLYSIGNAL

Process finished with exit code 1

slaren · 2023-05-05T19:48:49Z

To support this properly would require deeper changes, at least:

Ensuring that the dependencies of each operation are respected, so that no operations are run before their dependencies
Ensuring that enough work buffer memory is allocated to run multiple operations concurrently
Ensuring that each operation has a different, non-overlapping work buffer

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly

* Add low-level batching notebook * fix: tokenization of special characters: (ggml-org#850) It should behave like llama.cpp, where most out of the box usages treat special characters accordingly * Update CHANGELOG * Cleanup * Fix runner label * Update notebook * Use llama_decode and batch api * Support logits_all parameter --------- Co-authored-by: Antoine Lizee <[email protected]>

Run several single thread operator in worker threads

ac7a69f

howard0su commented Apr 8, 2023

View reviewed changes

howard0su mentioned this pull request Apr 8, 2023

Use Threadpool to schedule the work #851

Draft

howard0su changed the title ~~Run several single thread operators in the worker threads~~ Run several single thread operators parellel Apr 9, 2023

jon-chuang mentioned this pull request Apr 13, 2023

perf: CPU utilization is not 100% rustformers/llm#131

Closed

ggerganov added the threading Parallel processing and thread management label Apr 14, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

fix: tokenization of special characters: (ggml-org#850)

4d4e0f1

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run several single thread operators parellel #850

Run several single thread operators parellel #850

howard0su commented Apr 8, 2023 •

edited

Loading

howard0su commented Apr 8, 2023

howard0su Apr 8, 2023

howard0su commented Apr 9, 2023

jon-chuang commented Apr 13, 2023

howard0su commented Apr 13, 2023

ivanstepanovftw commented May 5, 2023

slaren commented May 5, 2023

Run several single thread operators parellel #850

Are you sure you want to change the base?

Run several single thread operators parellel #850

Conversation

howard0su commented Apr 8, 2023 • edited Loading

howard0su commented Apr 8, 2023

howard0su Apr 8, 2023

Choose a reason for hiding this comment

howard0su commented Apr 9, 2023

jon-chuang commented Apr 13, 2023

howard0su commented Apr 13, 2023

ivanstepanovftw commented May 5, 2023

slaren commented May 5, 2023

howard0su commented Apr 8, 2023 •

edited

Loading