Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to convert Smaug 72B #5807

Closed
schmorp opened this issue Mar 1, 2024 · 11 comments
Closed

Unable to convert Smaug 72B #5807

schmorp opened this issue Mar 1, 2024 · 11 comments

Comments

@schmorp
Copy link

schmorp commented Mar 1, 2024

I am unable to convert https://huggingface.co/abacusai/Smaug-72B-v0.1 (and others) to GGUF with either convert.py or convert-hf-to-gguf.py.

With the former, I get:

RuntimeError: Internal: ./src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

"internal" feels like a bug. When I add --vocab-type hfft (and then --pad-vocab because it tells me to), I get a nonfunctional model:

llm_load_vocab: SPM vocabulary, but newline token not found: unordered_map::at! Using special_pad_id instead.llm_load_vocab: mismatch in special tokens definition ( 421/152064 vs 214/152064 ).

and convert-hf-to-gguf.py does not support "LlamaForCausalLM".

@schmorp
Copy link
Author

schmorp commented Mar 1, 2024

llama.cpp release b2291 btw.

@dranger003
Copy link
Contributor

I think it works by changing this line:
https://github.com/ggerganov/llama.cpp/blob/e7433867288d2f142cffe596f3751bda5d7ee2c7/convert-hf-to-gguf.py#L262 to:

if arch in ("MixtralForCausalLM", "LlamaForCausalLM"):

@schmorp
Copy link
Author

schmorp commented Mar 2, 2024

Thanks a lot, but that just fails further down the line:

Error: Missing Smaug-72B-v0.1/tokenizer.model

I notice quite a lot of models on huggingface were apparently convertible a few weeks/months ago but no longer are (e..g TheBloke has GGUFs, but when I try converting with current versions of llama to make imatrix quants, they fail in lots of different ways). Is this considered a regression and should issues be created for those?

@dranger003
Copy link
Contributor

dranger003 commented Mar 2, 2024

Ah yes, I think I recreated it by loading the model using HF transformers and using save_pretrained() although I also think this just about to get fixed as I look at this PR #5821.
You could also use the tokenizer model from the model it was trained from, if that is available.

@schmorp
Copy link
Author

schmorp commented Mar 5, 2024

It seems most of the llama-2 derived models seem to have this vocabulary mismatch problem, and #5821 does not seem to help.

@christiandaley
Copy link

Any updates/workarounds on this? I just tried to convert Smaug 72B and I'm getting a error loading model: _map_base::at error when I try to run using llama.cpp.

@countzero
Copy link

countzero commented Mar 12, 2024

@schmorp, @dranger003 & @christiandaley

I am using the latest version of https://huggingface.co/abacusai/Smaug-72B-v0.1 and the llama.cpp release b2405.

Solution:

Smaug-72B-v0.1 is based on a model that uses a Byte Pair Encoding (BPE) vocabulary type. The conversion can be done with:

python convert.py --vocab-type "bpe" --pad-vocab --outfile ./models/Smaug-72B-v0.1.gguf /path/to/repository

Server Log:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
{"build":2405,"commit":"5cdb3717","function":"main","level":"INFO","line":2732,"msg":"build info","tid":"85172","timestamp":1710235396}
{"function":"main","level":"INFO","line":2739,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"85172","timestamp":1710235396,"total_threads":32}
llama_model_loader: loaded meta data with 22 key-value pairs and 1043 tensors from .\vendor\llama.cpp\models\Smaug-72B-v0.1.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = R:\AI\LLM\source
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 8192
llama_model_loader: - kv 4: llama.block_count u32 = 80
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 24576
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 64
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,152064] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,151387] = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 151643
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 481 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: mismatch in special tokens definition ( 421/152064 vs 213/152064 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 64
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 8192
llm_load_print_meta: n_embd_v_gqa = 8192
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 24576
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attm = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 65B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 72.29 B
llm_load_print_meta: model size = 47.78 GiB (5.68 BPW)
llm_load_print_meta: general.name = R:\AI\LLM\source
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
llm_load_print_meta: UNK token = 151643 '<|endoftext|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 30 '?'
llm_load_tensors: ggml ctx size = 0.40 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/81 layers to GPU
llm_load_tensors: CPU buffer size = 48926.31 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 20480.00 MiB
llama_new_context_with_model: KV self size = 20480.00 MiB, K (f16): 10240.00 MiB, V (f16): 10240.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 33.07 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1088.00 MiB
llama_new_context_with_model: graph splits (measure): 1
[1710235412] warming up the model with an empty run

Open Question:

Is the warning something to worry about?

llm_load_vocab: mismatch in special tokens definition ( 421/152064 vs 213/152064 ).

@schmorp
Copy link
Author

schmorp commented Mar 12, 2024

I thought I'd have tried that, but maybe I haven't. Thanks a lot for this tip!

As for the special token deifnition warning, I had this for a few other models, and they seemed to work, but ymmv.

@schmorp
Copy link
Author

schmorp commented Mar 12, 2024

Works, so it's a user error. Sorry for the noise.

@schmorp schmorp closed this as completed Mar 12, 2024
@schmorp
Copy link
Author

schmorp commented Mar 17, 2024

When converting with the command from @countzero, the resulting model crashes main/imatrix with:

terminate called after throwing an instance of 'std::out_of_range'
what(): unordered_map::at
Aborted

Sorry, @christiandaley, you actually reported this a week ago and I overlooked it.

@schmorp schmorp reopened this Mar 17, 2024
@github-actions github-actions bot added the stale label Apr 17, 2024
Copy link
Contributor

github-actions bot commented May 1, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants