OpenAI compatible web server: chat completion streaming: missing spaces #1208

didierguillevic · 2024-02-22T03:51:08Z

didierguillevic
Feb 22, 2024

OpenAI compatible web server

The web server is started with:
python3 -m llama_cpp.server --config_file llama_cpp_config.json

llama_cpp_config.json:
{
"host": "0.0.0.0",
"port": "6000",
"api_key": "some key",
"interrupt_requests": "False",
"models": [
{
"model": "data/models/TheBloke/zephyr-7B-beta-GGUF/zephyr-7b-beta.Q6_K.gguf",
"model_alias": "zephyr-7b-beta",
"n_ctx": 2048,
"chat_format": "zephyr",
"hf_tokenizer_config_path": "data/models/TheBloke/zephyr-7B-beta-GGUF/tokenizer_config.json",
"hf_pretrained_model_name_or_path": "HuggingFaceH4/zephyr-7b-beta"
}
]
}

OpenAI client and messages

client = openai.OpenAI(
base_url="http://localhost:6000/v1",
api_key="some key"
)

messages = [
{'role': 'system', 'content': 'You are a friendly chatbot.'},
{'role': 'user', 'content': 'How can we rationalize quantum entanglement?'}
]

Chat completion: ok!

response_chat = client.chat.completions.create(
model="zephyr-7b-beta",
messages=messages,
temperature=0,
max_tokens=1_024
)

response_chat:
I'm sorry but as an artificial intelligence language model, I don't have the ability to rationalize or understand concepts at a fundamental level. ...

Chat completion in streaming mode: missing spaces

response_stream = client.chat.completions.create(
model="zephyr-7b-beta",
messages=messages,
temperature=0,
max_tokens=1_024, # max number of tokens to generate
stream=True
)

response_contents = []
for chunk in response_stream:
chunk_message = chunk.choices[0].delta.content
print(chunk)
print(chunk_message)
response_contents.append(chunk_message)

ChatCompletionChunk(id='chatcmpl-583c9ff9-6d34', choices=[Choice(delta=ChoiceDelta(content=None, ...)
None
ChatCompletionChunk(id='chatcmpl-583c9ff9-6d34', choices=[Choice(delta=ChoiceDelta(content='I', ...)
I
ChatCompletionChunk(id='chatcmpl-583c9ff9-6d34', choices=[Choice(delta=ChoiceDelta(content="'", ...)
'
ChatCompletionChunk(id='chatcmpl-583c9ff9-6d34', choices=[Choice(delta=ChoiceDelta(content='m', ...)
m
ChatCompletionChunk(id='chatcmpl-583c9ff9-6d34', choices=[Choice(delta=ChoiceDelta(content='sorry', ...)
sorry
...

full_response = ''.join([m for m in response_contents if m is not None])

full_response:
I'msorrybutasanartificialintelligencelanguagemodel,Idon'thavetheabilitytorationalizeorunderstandconceptsatafundamentallevel....

** Issue **

Obviously some spaces are missing.
It would appear that the leading and trailing whitespace characters are removed from the ChoiceDelta.content before being returned. (equivalent of string.strip()).
Are there any way to keep those whitespaces? (which are obviously needed to stream the response to some user interface).

Thanks a lot for any input, comment, suggestion...

Answered by abetlen

Feb 23, 2024

@didierguillevic there was a bug with the LlamaHFTokenizer but it should be fixed in v0.2.50

View full answer

abetlen · 2024-02-23T20:56:43Z

abetlen
Feb 23, 2024
Maintainer

@didierguillevic there was a bug with the LlamaHFTokenizer but it should be fixed in v0.2.50

1 reply

didierguillevic Feb 24, 2024
Author

:-) Indeed, it now works perfectly with v0.2.50.

Any hope to have the conda package updated to v0.2.50? https://anaconda.org/conda-forge/llama-cpp-python
(Unfortunately, some companies do not allow pip install at the moment).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAI compatible web server: chat completion streaming: missing spaces #1208

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

OpenAI compatible web server: chat completion **streaming**: **missing spaces** #1208

didierguillevic Feb 22, 2024

Replies: 1 comment · 1 reply

abetlen Feb 23, 2024 Maintainer

didierguillevic Feb 24, 2024 Author

OpenAI compatible web server: chat completion streaming: missing spaces #1208

didierguillevic
Feb 22, 2024

Replies: 1 comment 1 reply

abetlen
Feb 23, 2024
Maintainer

didierguillevic Feb 24, 2024
Author