|
| 1 | +## Overview |
| 2 | + |
| 3 | +The `rpc-server` allows running `ggml` backend on a remote host. |
| 4 | +The RPC backend communicates with one or several instances of `rpc-server` and offloads computations to them. |
| 5 | +This can be used for distributed LLM inference with `llama.cpp` in the following way: |
| 6 | + |
| 7 | +```mermaid |
| 8 | +flowchart TD |
| 9 | + rpcb---|TCP|srva |
| 10 | + rpcb---|TCP|srvb |
| 11 | + rpcb-.-|TCP|srvn |
| 12 | + subgraph hostn[Host N] |
| 13 | + srvn[rpc-server]-.-backend3["Backend (CUDA,Metal,etc.)"] |
| 14 | + end |
| 15 | + subgraph hostb[Host B] |
| 16 | + srvb[rpc-server]---backend2["Backend (CUDA,Metal,etc.)"] |
| 17 | + end |
| 18 | + subgraph hosta[Host A] |
| 19 | + srva[rpc-server]---backend["Backend (CUDA,Metal,etc.)"] |
| 20 | + end |
| 21 | + subgraph host[Main Host] |
| 22 | + ggml[llama.cpp]---rpcb[RPC backend] |
| 23 | + end |
| 24 | + style hostn stroke:#66,stroke-width:2px,stroke-dasharray: 5 5 |
| 25 | +``` |
| 26 | + |
| 27 | +Each host can run a different backend, e.g. one with CUDA and another with Metal. |
| 28 | +You can also run multiple `rpc-server` instances on the same host, each with a different backend. |
| 29 | + |
| 30 | +## Usage |
| 31 | + |
| 32 | +On each host, build the corresponding backend with `cmake` and add `-DLLAMA_RPC=ON` to the build options. |
| 33 | +For example, to build the CUDA backend with RPC support: |
| 34 | + |
| 35 | +```bash |
| 36 | +mkdir build-rpc-cuda |
| 37 | +cd build-rpc-cuda |
| 38 | +cmake .. -DLLAMA_CUDA=ON -DLLAMA_RPC=ON |
| 39 | +cmake --build . --config Release |
| 40 | +``` |
| 41 | + |
| 42 | +Then, start the `rpc-server` with the backend: |
| 43 | + |
| 44 | +```bash |
| 45 | +$ bin/rpc-server 0.0.0.0 50052 |
| 46 | +create_backend: using CUDA backend |
| 47 | +ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no |
| 48 | +ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes |
| 49 | +ggml_cuda_init: found 1 CUDA devices: |
| 50 | + Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5, VMM: yes |
| 51 | +Starting RPC server on 0.0.0.0:50052 |
| 52 | +``` |
| 53 | + |
| 54 | +When using the CUDA backend, you can specify the device with the `CUDA_VISIBLE_DEVICES` environment variable, e.g.: |
| 55 | +```bash |
| 56 | +$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server 0.0.0.0 50052 |
| 57 | +``` |
| 58 | +This way you can run multiple `rpc-server` instances on the same host, each with a different CUDA device. |
| 59 | + |
| 60 | + |
| 61 | +On the main host build `llama.cpp` only with `-DLLAMA_RPC=ON`: |
| 62 | + |
| 63 | +```bash |
| 64 | +mkdir build-rpc |
| 65 | +cd build-rpc |
| 66 | +cmake .. -DLLAMA_RPC=ON |
| 67 | +cmake --build . --config Release |
| 68 | +``` |
| 69 | + |
| 70 | +Finally, use the `--rpc` option to specify the host and port of each `rpc-server`: |
| 71 | + |
| 72 | +```bash |
| 73 | +$ bin/main -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.10:50052,192.168.88.11:50052 -ngl 99 |
| 74 | +``` |
0 commit comments