How to deploy MPT with llama.cpp? #939

streetycat · 2023-11-24T04:34:38Z

streetycat
Nov 24, 2023

I found that llama.cpp already supports mpt, I downloaded gguf from here, and it did load it with llama.cpp, but its return result looks bad.

I start the server as follow:

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v ${/path/to/models}:/models -e MODEL=/models/${model-filename} llama-cpp-python-cuda python3 -m llama_cpp.server --n_gpu_layers ${X} --n_ctx ${L}

And I post the request as follow:

URL: http://localhost:8000/v1/chat/completions
BODY:

{
"messages": [
{"role": "user", "content": "What is 5+7?"}],
"max_tokens": 8000
}

RESPONCE:

{
    "id": "chatcmpl-bfe9eaf4-2ab4-419d-ab8b-313add0706f9",
    "object": "chat.completion",
    "created": 1700798859,
    "model": "/models/mosaicml-mpt-30b-chat-Q8_0.gguf",
    "choices": [
        {
            "index": 0,
            "message": {
                "content": "></s>\n<s>&nbsp;</s><br/> </body> </html> \n\nNotice the use of JavaScript to handle user input and submit it via AJAX. ...\n\n",
                "role": "assistant"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 16,
        "completion_tokens": 2258,
        "total_tokens": 2274
    }
}

But, it works well on the demo page:

I worked well with llama in the same method, I don't know what's wrong happened.

gardner · 2023-11-25T06:22:16Z

gardner
Nov 25, 2023

How are you sending the POST? Can you share your curl command or fetch() request?

0 replies

streetycat · 2023-11-27T04:01:20Z

streetycat
Nov 27, 2023
Author

Thank you, I have found a parameter(chat_format) in source code for startup to fix it.

and update the command:

docker run --gpus all --rm -it -p 8000:8000 -v ${/path/to/models}:/models -e MODEL=/models/${model-filename} llama-cpp-python-cuda python3 -m llama_cpp.server --n_gpu_layers ${X} --n_ctx ${L} --chat_format mistrallite

the responce:

By the way, maybe the </s> is a bug for the chat_format?

3 replies

gardner Nov 27, 2023

I can't find anywhere that mosaic describes the prompt template. There is one note on the model card here: https://huggingface.co/TheBloke/mpt-30B-chat-GGML#prompt-template

streetycat Nov 27, 2023
Author

Thank you for your help, I am a beginner and don't know where to find this information. I guess gguf and ggml are just an information organization format, and the specific behavior of the model should be consistent with the official mpt.

gardner Nov 28, 2023

Correct. .gguf is just a file format to store the model weights. Much like .zip is a file format to store other files. The contents of a .gguf can contain different models that have been trained to use different prompt formats. You can usually find the prompt format on the model card on HuggingFace but sometimes you need to do some searching to figure out what to use.

Good luck in your learnings!

tk-master · 2023-11-28T21:56:34Z

tk-master
Nov 28, 2023

Definitely looks like prompt format issue

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deploy MPT with llama.cpp? #939

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to deploy MPT with llama.cpp? #939

streetycat Nov 24, 2023

Replies: 3 comments · 3 replies

gardner Nov 25, 2023

streetycat Nov 27, 2023 Author

gardner Nov 27, 2023

streetycat Nov 27, 2023 Author

gardner Nov 28, 2023

tk-master Nov 28, 2023

streetycat
Nov 24, 2023

Replies: 3 comments 3 replies

gardner
Nov 25, 2023

streetycat
Nov 27, 2023
Author

streetycat Nov 27, 2023
Author

tk-master
Nov 28, 2023