Skip to content

Run several single thread operators parellel #850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

howard0su
Copy link
Collaborator

@howard0su howard0su commented Apr 8, 2023

In 18 threads testing, it shows about 5% performance gain.

@howard0su
Copy link
Collaborator Author

In my testing, this will give noticable difference when running in high # of threads:

Before

Running with 18 threads...
         18 threads | run 1/4 | current token time 237.93 ms - eval time 32116.6 ms - prompt eval time 1903.46 ms
         18 threads | run 2/4 | current token time 223.18 ms - eval time 33081.06 ms - prompt eval time 1785.41 ms
         18 threads | run 3/4 | current token time 560.18 ms - eval time 42127.41 ms - prompt eval time 4481.46 ms
         18 threads | run 4/4 | current token time 226.64 ms - eval time 32145.81 ms - prompt eval time 1813.12 ms

After

Running with 18 threads...
         18 threads | run 1/4 | current token time 221.68 ms - eval time 30907.99 ms - prompt eval time 1773.44 ms
         18 threads | run 2/4 | current token time 225.93 ms - eval time 31100.67 ms - prompt eval time 1807.41 ms
         18 threads | run 3/4 | current token time 222.29 ms - eval time 30184.69 ms - prompt eval time 1778.33 ms
         18 threads | run 4/4 | current token time 233.21 ms - eval time 31018.9 ms - prompt eval time 1865.65 ms

} else {
int start = i;
int end = i + 1;
while (end < cgraph->n_nodes && dispath_threads < n_threads && (end - start) < 4)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic number 4 needs some tuning.

@howard0su howard0su changed the title Run several single thread operators in the worker threads Run several single thread operators parellel Apr 9, 2023
@howard0su
Copy link
Collaborator Author

Eval time:

+--------------------------------------------------------------------------+
|+                 +  + +                         x                       x|
|     |__________A____M____|                |_____M_______A_____________|  |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   3       32116.6      33081.06      32145.81     32447.823     548.59349
+   4      30184.69      31100.67       31018.9     30803.062     419.74213
Difference at 95.0% confidence
        -1644.76 +/- 933.691
        -5.06894% +/- 2.87751%
        (Student's t, pooled s = 475.491)

@jon-chuang
Copy link
Contributor

In 18 threads testing, it shows about 5% performance gain.

18 threads on a how many C/T machine?

@howard0su
Copy link
Collaborator Author

In 18 threads testing, it shows about 5% performance gain.

18 threads on a how many C/T machine?

10cores 20threads win10 box

@ggerganov ggerganov added the threading Parallel processing and thread management label Apr 14, 2023
@ivanstepanovftw
Copy link
Collaborator

Getting segfault

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/LLaMA/7B/ggml-model-q4_0.bin --prompt "William Safire will walk us through the nuances of bad" --threads 1 --seed 1 --n_predict 16 --tfs 0.97 --mirostat 2 --mirostat_ent 5
main: build = 513 (11702ed)
main: seed  = 1
llama.cpp: loading model from models/LLaMA/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 1 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 0, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 0.970000, top_p = 1.000000, typical_p = 1.000000, temp = 1.000000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 8, n_predict = 16, n_keep = 0


 William Safire will walk us through/p/i/llama.cpp/ggml.c:12128:49: runtime error: member access within null pointer of type 'struct ggml_compute_state'
pc_0x508ca7###func_ggml_graph_compute###file_/p/i/llama.cpp/ggml.c###line_12128###obj_(main+0x508ca7)
pc_0x56e9c8###func_llama_eval_internal###file_/p/i/llama.cpp/llama.cpp###line_1272###obj_(main+0x56e9c8)
pc_0x571ff0###func_llama_eval###file_/p/i/llama.cpp/llama.cpp###line_2726###obj_(main+0x571ff0)
pc_0x41a78a###func_main###file_/p/i/llama.cpp/examples/main/main.cpp###line_360###obj_(main+0x41a78a)
pc_0x7ff119a4a50f###func___libc_start_call_main###file_<null>###line_0###obj_(libc.so.6+0x2750f)
pc_0x7ff119a4a5c8###func___libc_start_main@GLIBC_2.2.5###file_<null>###line_0###obj_(libc.so.6+0x275c8)
pc_0x423624###func__start###file_<null>###line_0###obj_(main+0x423624)

AddressSanitizer:DEADLYSIGNAL

Process finished with exit code 1

@slaren
Copy link
Member

slaren commented May 5, 2023

To support this properly would require deeper changes, at least:

  • Ensuring that the dependencies of each operation are respected, so that no operations are run before their dependencies
  • Ensuring that enough work buffer memory is allocated to run multiple operations concurrently
  • Ensuring that each operation has a different, non-overlapping work buffer

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
It should behave like llama.cpp, where most out of the box usages
treat special characters accordingly
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
* Add low-level batching notebook

* fix: tokenization of special characters: (ggml-org#850)

It should behave like llama.cpp, where most out of the box usages
treat special characters accordingly

* Update CHANGELOG

* Cleanup

* Fix runner label

* Update notebook

* Use llama_decode and batch api

* Support logits_all parameter

---------

Co-authored-by: Antoine Lizee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
threading Parallel processing and thread management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants