50% performance when doing inference with web server. #1548

Vaskivo · 2024-06-23T20:01:37Z

Vaskivo
Jun 23, 2024

Hi.
I've noticed a big degradation on inference speed when using the webserver vs "just using the library".

Whenever I run my script, I get around 80 tokens/sec. When using the server I get 40 tokens/sec for the exact same prompt, samplers, llm config.

OS: Linux
GPU: Nvidia RTX 3090
library versions: I discovered this on version 0.2.75 to the latest. I used CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_FORCE_MMQ=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]==<version> when installing, ensuring I wasn't using pip cache. I also tried it with the prebuilt wheels.

I using the same prompt every time, at temperature 0. And making repeated identical calls (in case the problem was with the cache).

When using the server:

Server config:

{
    "host": "0.0.0.0",
    "port": 8080,
    "models": [
        {
            "model": "../MODELS/Meta-Llama-3-8B-Instruct-Q8_0.gguf",
            "model_alias": "llama3",
            "n_ctx": 8192,
            "n_threads": 12,
            "n_threads_batch": 12,
            "n_batch": 512,
            "use_mmap": true,
            "use_mlock": false,
            "mul_mat_q": true,
            "numa": false,
            "n_gpu_layers": 33,
            "offload_kqv": true,
            "flash_attn": true,
            "cache": true
        }
    ]
}

Request:

POST http://127.0.0.1:8080/v1/completions
{
  "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\r\n\r\nYou are a helpful AI assistant that rewrites every input as if it were written bu a pirate. You a lenghtly and thorough, rewriting every bit of the text.<|eot_id|><|start_header_id|>user<|end_header_id|>\r\n\r\n<<<---MY VERY BIG TEXT--->>><|eot_id|><|start_header_id|>assistant<|end_header_id|>\r\n\r\n",
  "max_tokens": 2048,
  "temperature": 0,
  "seed": 123,
  "stream": false
}

My script:

import llama_cpp

PROMPT = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant that rewrites every input as if it were written bu a pirate. You a lenghtly and thorough, rewriting every bit of the text.<|eot_id|><|start_header_id|>user<|end_header_id|>

<<<---MY VERY BIG TEXT--->>><|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

new_param = {
    "model_path": "../MODELS/Meta-Llama-3-8B-Instruct-Q8_0.gguf",
    "n_ctx": 8192,
    "n_threads": 12,
    "n_threads_batch": 12,
    "n_batch": 512,
    "use_mmap": True,
    "use_mlock": False,
    "mul_mat_q": True,
    "numa": False,
    "n_gpu_layers": 33,
    "rope_freq_base": 500000,
    "tensor_split": None,
    "rope_freq_scale": 1.0,
    "offload_kqv": True,
    "split_mode": 1,
    "flash_attn": True,
    "cache": True,
}

llm = llama_cpp.Llama(**new_param)

cache = llama_cpp.LlamaRAMCache(capacity_bytes=2 << 30)
llm.set_cache(cache)

copy_of_server_default_sampler = {
    'suffix': None,
    'max_tokens': 2048,
    'temperature': 0.0,
    'top_p': 0.95,
    'min_p': 0.05,
    'echo': False,
    'stop': None,
    'stream': False,
    'logprobs': None,
    'presence_penalty': 0.0,
    'frequency_penalty': 0.0,
    'logit_bias': None,
    'seed': 123,
    'model': None,
    'top_k': 40,
    'repeat_penalty': 1.1,
    'mirostat_mode': 0,
    'mirostat_tau': 5.0,
    'mirostat_eta': 0.1,
    'grammar': None,
}

def make_completion():
    output = llm(
        PROMPT,
        **copy_of_server_default_sampler,
    )
    print(output)

Extra 1

I also tried to time the call to this function: https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L742
(I did it by editing the filed directly in my venv. And I used time.time(). I know it's unreliable but the results were consistent.)
When running the script each call took around 0.00077 seconds. While the with the server it took 0.00085 seconds. While worst, it's not double the time 🤷

Extra 2

Using nvtop, I also noticed that the script would be way more intense on my GPU, taking it to around 90% while the server would make it go to 50% usage.

With all of the above said... is this lack of performance cause by some misconfiguration on my part? If so, how can I improve it?

Or is is something inherent to the implementation of the server? (I could see the multithreading/multuser implementation having something to do with this)

Thanks in advance 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

50% performance when doing inference with web server. #1548

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

50% performance when doing inference with web server. #1548

Vaskivo Jun 23, 2024

When using the server:

My script:

Extra 1

Extra 2

Replies: 0 comments

Vaskivo
Jun 23, 2024