Prompt interrupted before continuation for Unicode UTF-8 emojis #63

loretoparisi · 2023-03-12T21:43:19Z

I have found that when having a Unicode UTF- emoji char like

Unicode Character “👍” (U+1F44D)

The prompts breaks up.

I'm reading a sample prompt from a text file:

cat prompt

Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 👍"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:

Looking at logs I can see in fact that the tokenizers breaks at the (U+1F44D) char code:

(base)$ p=$(cat prompt); ./main -m ./models/13B/ggml-model-q4_0.bin -p $p -t 4 -n 512
main: seed = 1678656464
llama_model_load: loading model from './models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from './models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from './models/13B/ggml-model-q4_0.bin.1'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363

main: prompt: 'Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 👍"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:'
main: number of tokens in prompt = 36
     1 -> ''
 27418 -> 'Tw'
  3905 -> 'ee'
 29873 -> 't'
 29901 -> ':'
   376 -> ' "'
 29902 -> 'I'
 26277 -> ' hate'
   372 -> ' it'
   746 -> ' when'
   590 -> ' my'
  9008 -> ' phone'
 16988 -> ' battery'
  2977 -> ' dies'
  1213 -> '."'
    13 -> '
'
  2008 -> 'Se'
   593 -> 'nt'
  2073 -> 'iment'
 29901 -> ':'
 12610 -> ' Neg'
  1230 -> 'ative'
    13 -> '
'
  2277 -> '##'
 29937 -> '#'
    13 -> '
'
 27418 -> 'Tw'
  3905 -> 'ee'
 29873 -> 't'
 29901 -> ':'
   376 -> ' "'
  3421 -> 'My'
  2462 -> ' day'
   756 -> ' has'
  1063 -> ' been'
 29871 -> ' '

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 10 times better than yesterday. Now I have to sleep again..."
Sentiment: Neutral
###
Twitter is not about talking; Twitter is a social network for listening and responding instantly, as the tweets of Steve Jobs demonstrate well in Figure A-2 (page ). Just be sure you can interpret the information accurately. If the sentiment isn't clearly positive or negative—as^C

resulting in a broken input prompt.

The text was updated successfully, but these errors were encountered:

beiller · 2023-03-13T01:24:14Z

Its unable to support emojis without the unicode support fix. I have a branch or we can wait until more work is done here.

PR: #66

Branch: https://github.com/beiller/llama.cpp/tree/feature/tokenization

sw · 2023-04-01T07:43:18Z

I believe this was fixed by #79

Low level chat: Added iterative search to prevent instructions from being echoed

This was referenced Mar 13, 2023

Add sentencepiece tokenizer and modify build (Support UTF-8 / Emoijs) #66

Closed

Unicode support #11

Closed

gjmulder added bug Something isn't working duplicate This issue or pull request already exists enhancement New feature or request and removed duplicate This issue or pull request already exists labels Mar 15, 2023

sw closed this as completed Apr 1, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Merge pull request ggml-org#63 from SagsMug/main

63d8a3c

Low level chat: Added iterative search to prevent instructions from being echoed

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompt interrupted before continuation for Unicode UTF-8 emojis #63

Prompt interrupted before continuation for Unicode UTF-8 emojis #63

loretoparisi commented Mar 12, 2023

beiller commented Mar 13, 2023

sw commented Apr 1, 2023

Prompt interrupted before continuation for Unicode UTF-8 emojis #63

Prompt interrupted before continuation for Unicode UTF-8 emojis #63

Comments

loretoparisi commented Mar 12, 2023

beiller commented Mar 13, 2023

sw commented Apr 1, 2023