Bug: MiniCPM-V-2.6 commit d565bb2fd5a2a58b9924a7a34e77a87c78c52137 causing crash in moondream #9066

saket424 · 2024-08-17T17:14:50Z

What happened?

export LLAMA_CUDA=1 # only if for NViDiA CUDA
export CUDA_DOCKER_ARCH=compute_86
make -j$(nproc) NVCC=/usr/local/cuda/bin/nvcc

./llama-llava-cli -m ./m2/moondream2-text-model-f16.gguf --mmproj ./m2/moondream2-mmproj-f16.gguf --image ./assets/demo-2.jpg -p "describe the image" --temp 0.1 -c 2048

core dump

before this commit no crash

Since minicpm2.6 has a completely separate cli, i did not expect it to affect llama-llava-cli which moondream uses

Crash only observed on linux cuda and not on Mac

Name and Version

Yes crash with version 3598

No crash with
./llama-cli --version
version: 3597 (ee2984b)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

anand@nitro17:~/moondream-stuff/llama.cpp$ ./llama-llava-cli -m ./m2/moondream2-text-model-f16.gguf --mmproj ./m2/moondream2-mmproj-f16.gguf  --image ./assets/demo-2.jpg -p "describe the image" --temp 0.1 -c 2048
Log start
llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from ./m2/moondream2-text-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = moondream2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2048
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi2.block_count u32              = 24
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - type  f32:  147 tensors
llama_model_loader: - type  f16:   98 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens cache size = 944
llm_load_vocab: token to piece cache size = 0.3151 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 1.42 B
llm_load_print_meta: model size       = 2.64 GiB (16.01 BPW) 
llm_load_print_meta: general.name     = moondream2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 50256 '<|endoftext|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
llm_load_tensors:        CPU buffer size =  2706.27 MiB
................................................................................
clip_model_load: model name:   vikhyatk/moondream2
clip_model_load: description:  image encoder for vikhyatk/moondream2
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    457
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from ./m2/moondream2-mmproj-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = vikhyatk/moondream2
clip_model_load: - kv   6:                        general.description str              = image encoder for vikhyatk/moondream2
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 378
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 2048
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 28
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  18:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  172 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     867.61 MB
clip_model_load: metadata size:  0.16 MB
clip_model_load: params backend buffer size =  867.61 MB (457 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 50.10 MB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.20 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   304.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 921
llama_new_context_with_model: graph splits = 294
encode_image_with_clip: image embedding created: 729 tokens

encode_image_with_clip: image encoded in   167.45 ms by CLIP (    0.23 ms per image patch)

 The image shows a computer server rack with multiple computer boards and components on it. The rack is placed on a carpeted floor, and there is a chair nearby. The computer boards are connected to the rack using wires, and the rack is positioned in a room with a brick wall in the background.

llama_print_timings:        load time =    1776.52 ms
llama_print_timings:      sample time =       1.56 ms /    61 runs   (    0.03 ms per token, 39203.08 tokens per second)
llama_print_timings: prompt eval time =     963.07 ms /   770 tokens (    1.25 ms per token,   799.52 tokens per second)
llama_print_timings:        eval time =    3473.04 ms /    60 runs   (   57.88 ms per token,    17.28 tokens per second)
llama_print_timings:       total time =    5310.63 ms /   830 tokens
anand@nitro17:~/moondream-stuff/llama.cpp$



anand@nitro17:~/moondream-stuff/llama.cpp$ ./llama-llava-cli -m ./m2/moondream2-text-model-f16.gguf --mmproj ./m2/moondream2-mmproj-f16.gguf  --image ./assets/demo-2.jpg -p "describe the image" --temp 0.1 -c 2048
Log start
llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from ./m2/moondream2-text-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = moondream2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2048
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi2.block_count u32              = 24
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - type  f32:  147 tensors
llama_model_loader: - type  f16:   98 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens cache size = 944
llm_load_vocab: token to piece cache size = 0.3151 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 1.42 B
llm_load_print_meta: model size       = 2.64 GiB (16.01 BPW) 
llm_load_print_meta: general.name     = moondream2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 50256 '<|endoftext|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
llm_load_tensors:        CPU buffer size =  2706.27 MiB
................................................................................
clip_model_load: model name:   vikhyatk/moondream2
clip_model_load: description:  image encoder for vikhyatk/moondream2
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    457
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from ./m2/moondream2-mmproj-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = vikhyatk/moondream2
clip_model_load: - kv   6:                        general.description str              = image encoder for vikhyatk/moondream2
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 378
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 2048
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 28
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  18:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  172 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     867.61 MB
clip_model_load: metadata size:  0.16 MB
clip_model_load: params backend buffer size =  867.61 MB (457 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 50.10 MB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.20 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   304.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 921
llama_new_context_with_model: graph splits = 294
Segmentation fault (core dumped)

The text was updated successfully, but these errors were encountered:

arch-btw · 2024-08-17T22:34:54Z

version: 3597 (ee2984b)

I think that was the commit before MiniCPM-V-2.6 got merged. So it might be something else.

saket424 · 2024-08-17T22:45:56Z

version 3597 works and version 3598 bombs. i narrowed it down. it should be easy enough for someone to reproduce this

LostRuins · 2024-08-18T16:10:31Z

Can confirm it's broken for llava. It seems to work intermittently, probably some out of bounds memory access.

fairydreaming · 2024-08-18T16:13:44Z

It crashes in this assert located in GGML get rows operation:

(gdb) 
#7  0x0000555555632eb2 in ggml_compute_forward_get_rows_f32 (params=0x7ffedd080ce0, dst=0x555555ce80b0)
    at ggml/src/ggml.c:13345
13345	        assert(i01 >= 0 && i01 < ne01);
(gdb) print i01
$1 = 729
(gdb) print ne01
$2 = 729

The direct cause is that the index in get rows operation is outside the valid range. I noticed that dst->src[1] is named patches, so I think it's the one created here:

https://github.com/ggerganov/llama.cpp/blob/554b049068de24201d19dde2fa83e35389d4585d/examples/llava/clip.cpp#L2418-L2426

Note the i + 1 in this loop. I made the following change:

diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
index 342042ff..224db9b5 100644
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -2419,7 +2419,7 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
             struct ggml_tensor * patches = ggml_graph_get_tensor(gf, "patches");
             int* patches_data = (int*)malloc(ggml_nbytes(patches));
             for (int i = 0; i < num_patches; i++) {
-                patches_data[i] = i + 1;
+                patches_data[i] = i;
             }
             ggml_backend_tensor_set(patches, patches_data, 0, ggml_nbytes(patches));
             free(patches_data);

And it no longer crashes:

(base) phm@epyc:~/projects/llama.cpp$ ./llama-llava-cli --numa distribute -t 32 -m /mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-text-model-f16.gguf --mmproj /mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-mmproj-f16.gguf --image ~/Downloads/demo-2.jpg -p "describe the image" --temp 0.1 -c 2048
Log start
llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from /mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-text-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = moondream2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2048
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi2.block_count u32              = 24
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - type  f32:  147 tensors
llama_model_loader: - type  f16:   98 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens cache size = 944
llm_load_vocab: token to piece cache size = 0.3151 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 1.42 B
llm_load_print_meta: model size       = 2.64 GiB (16.01 BPW) 
llm_load_print_meta: general.name     = moondream2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 50256 '<|endoftext|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  2706.27 MiB
................................................................................
clip_model_load: model name:   vikhyatk/moondream2
clip_model_load: description:  image encoder for vikhyatk/moondream2
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    457
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 457 tensors from /mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-mmproj-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = vikhyatk/moondream2
clip_model_load: - kv   6:                        general.description str              = image encoder for vikhyatk/moondream2
clip_model_load: - kv   7:                        clip.projector_type str              = mlp
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 378
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 2048
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 28
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  18:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  172 tensors
clip_model_load: CLIP using CPU backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     867.61 MB
clip_model_load: metadata size:  0.16 MB
clip_model_load: params backend buffer size =  867.61 MB (457 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 50.10 MB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.20 MiB
llama_new_context_with_model:        CPU compute buffer size =   160.01 MiB
llama_new_context_with_model: graph nodes  = 921
llama_new_context_with_model: graph splits = 1
encode_image_with_clip: image embedding created: 729 tokens

encode_image_with_clip: image encoded in   422.27 ms by CLIP (    0.58 ms per image patch)

 The image shows a computer cooling rack with several computer parts on it. The rack is placed on a carpeted floor, and there is a couch in the background. The computer parts include a large black computer tower, multiple computer fans, and various other components. The rack is filled with these parts, indicating that it is likely being used for assembling or disassembling computer systems.

llama_print_timings:        load time =    4348.98 ms
llama_print_timings:      sample time =       2.08 ms /    77 runs   (    0.03 ms per token, 36930.46 tokens per second)
llama_print_timings: prompt eval time =    3340.20 ms /   770 tokens (    4.34 ms per token,   230.53 tokens per second)
llama_print_timings:        eval time =    1033.80 ms /    76 runs   (   13.60 ms per token,    73.51 tokens per second)
llama_print_timings:       total time =    5400.74 ms /   846 tokens

I guess the question remains why it worked before and now it doesn't? I have no idea yet :/

LostRuins · 2024-08-18T16:30:07Z

~~That does seem to fix it, although I can't be sure. On first glance llava 1.5 no longer crashes.~~

The crash was very inconsistent, probably because sometimes this off-by-one access wasn't actually out of bounds memory (maybe due to padding?).

It was extra weird because adding a simple printf before calls to clip_is_minicpmv would prevent it from crashing as well. I suspect this issue was already present for quite some time.

LostRuins · 2024-08-18T16:32:37Z

Edit: nope, I think this does not solve the issue. I am still getting intermittent segfaults

fairydreaming · 2024-08-18T16:39:58Z

@LostRuins I just tried release builds, in my case only the debug builds (LLAMA_DEBUG=1) crashed on this assert, release build worked without problems. So this may be an entirely unrelated problem after all. I can't reproduce crashes in release builds.

LostRuins · 2024-08-18T16:42:21Z

I'm not hitting any assert. I am getting a segmentation fault

exception: access violation reading 0x0000657669736E65

Adding the abovementioned print statements before every call to clip_is_minicpmv (temporarily) resolves this, but that's not a proper solution - there's definitely still some out of bounds access going on.

fairydreaming · 2024-08-18T17:00:06Z

It finally crashed. I guess the important part is LLAMA_CUDA=1.

slaren · 2024-08-18T17:14:44Z

I guess the question remains why it worked before and now it doesn't? I have no idea yet :/

This assert was added fairly recently (in #6210), so previously this wouldn't be noticed even in debug builds. It would cause wrong data to be returned, but since more tensors are allocated in the same buffer, it is not likely to cause it to crash with an invalid access. It looks like a logic error in the clip implementation, and it may have affected the quality of the generation.

fairydreaming · 2024-08-18T17:22:21Z

Try this:

diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
index 342042ff..8ce4add1 100644
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -1108,7 +1108,7 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
         }
     }
 
-    clip_ctx * new_clip = new clip_ctx;
+    clip_ctx * new_clip = new clip_ctx{};
 
     // update projector type
     {

I noticed that the default constructor of clip_ctx didn't initialize the fields, so they were basically all filled with garbage:

Thread 1 "llama-llava-cli" hit Breakpoint 6, clip_model_load (fname=0x55556334d210 "/mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-mmproj-f16.gguf", verbosity=1) at examples/llava/clip.cpp:1111
1111	    clip_ctx * new_clip = new clip_ctx;
(gdb) n
1115	        int idx = gguf_find_key(ctx, KEY_PROJ_TYPE);
(gdb) print *new_clip
$10 = {has_text_encoder = false, has_vision_encoder = false, has_llava_projector = false, has_minicpmv_projector = false, 
  minicpmv_version = 2, vision_model = {hparams = {image_size = -283352320, patch_size = 32767, hidden_size = 1664890336, 
      n_intermediate = 21845, projection_dim = 1664890336, n_head = 21845, n_layer = 779485184, eps = 5.3529077e-11, 
      mm_patch_merge_type = "flat", '\000' <repeats 27 times>, image_grid_pinpoints = {0, 1, 854052736, 0, 20, 0, 1818373750, 
        909258347, 1953784110, 779509614, 1935763810, 1, 1152, 0, 0, 856706944, 0, 24, 0, 1818373750, 909258347, 1953784110, 
        1970233198, 1702309492, 1952999273, 2, 1152, 0, 1152, 0, 1, 856711552}, image_crop_resolution = 0}, 
    class_embedding = 0x6c622e7600000000, patch_embeddings = 0x7474612e36322e6b, patch_bias = 0x69622e74756f5f6e, 
    position_embeddings = 0x480000000017361, pre_ln_w = 0x0, pre_ln_b = 0x3338e1800000, 
    layers = std::vector of length 0, capacity 0, post_ln_w = 0x17468676965, post_ln_b = 0x48000, 
    projection = 0x38f3800000000000, mm_0_w = 0x0, mm_0_b = 0x0, mm_2_w = 0x0, mm_2_b = 0x0, image_newline = 0x0, 
    mm_1_w = 0x0, mm_1_b = 0x0, mm_3_w = 0x0, mm_3_b = 0x0, mm_4_w = 0x0, mm_4_b = 0x0, mm_model_mlp_1_w = 0x4800000, 
    mm_model_mlp_1_b = 0x10d00000, mm_model_mlp_3_w = 0x1780000000010000, mm_model_mlp_3_b = 0x16000000003339, 
    mm_model_block_1_block_0_0_w = 0x2e76000000000000, mm_model_block_1_block_0_1_w = 0x662e36322e6b6c62, 
    mm_model_block_1_block_0_1_b = 0x2e6e776f645f6e66, mm_model_block_1_block_1_fc1_w = 0x173616962, 
    mm_model_block_1_block_1_fc1_b = 0x10d0, mm_model_block_1_block_1_fc2_w = 0x33d0678000000000, 
    mm_model_block_1_block_1_fc2_b = 0x1600000000, mm_model_block_1_block_2_0_w = 0x6c622e7600000000, 
    mm_model_block_1_block_2_1_w = 0x6e66662e36322e6b, mm_model_block_1_block_2_1_b = 0x676965772e70755f, 
    mm_model_block_2_block_0_0_w = 0x10d0000000027468, mm_model_block_2_block_0_1_w = 0x480000000000000, 
    mm_model_block_2_block_0_1_b = 0x1000000000000, mm_model_block_2_block_1_fc1_w = 0x33d0aac00000, 
    mm_model_block_2_block_1_fc1_b = 0x140000, mm_model_block_2_block_1_fc2_w = 0x2e6b6c622e760000, 
    mm_model_block_2_block_1_fc2_b = 0x755f6e66662e3632, mm_model_block_2_block_2_0_w = 0x1736169622e70, 
    mm_model_block_2_block_2_1_w = 0x4800000, mm_model_block_2_block_2_1_b = 0xfac0000000000000, 
    mm_model_mlp_0_w = 0x13000000003467, mm_model_mlp_0_b = 0x2e76000000000000, mm_model_mlp_2_w = 0x6c2e36322e6b6c62, 
    mm_model_mlp_2_b = 0x68676965772e326e, mm_model_peg_0_w = 0x4800000000174, mm_model_peg_0_b = 0x0, 
    mm_model_pos_embed_k = 0x34680cc000, mm_model_query = 0x1100, mm_model_proj = 0x322e6b6c622e7600, 
    mm_model_kv_proj = 0x69622e326e6c2e36, mm_model_attn_q_w = 0x480000000017361, mm_model_attn_q_b = 0x0, 
    mm_model_attn_k_w = 0x34681ec00000, mm_model_attn_k_b = 0x160000, mm_model_attn_v_w = 0x2e6b6c622e760000, 
    mm_model_attn_v_b = 0x5f6e7474612e3732, mm_model_attn_o_w = 0x7468676965772e71, mm_model_attn_o_b = 0x48000000002, 
    mm_model_ln_q_w = 0x48000000000, mm_model_ln_q_b = 0x100000000, mm_model_ln_kv_w = 0x346830c0, mm_model_ln_kv_b = 0x14, 
    mm_model_ln_post_w = 0x37322e6b6c622e76, mm_model_ln_post_b = 0x2e715f6e7474612e}, proj_type = PROJECTOR_TYPE_MLP, 
  image_mean = {1.40129846e-45, 1.61429583e-42, 0}, image_std = {0, 2.69506927e-07, 0}, use_gelu = false, ftype = 1, 
--Type <RET> for more, q to quit, c to continue without paging--
  has_class_embedding = true, has_pre_norm = true, has_post_norm = false, has_patch_bias = false, 
  ctx_gguf = 0x7474612e37322e6b, ctx_data = 0x676965772e6b5f6e, buf_compute_meta = std::vector of length 0, capacity 0, 
  params_buffer = 0x0, backend = 0x0, compute_alloc = 0x0, load_image_size = 0x5f6e7474612e3732}

After the change:

Thread 1 "llama-llava-cli" hit Breakpoint 6, clip_model_load (fname=0x55556334d210 "/mnt/md0/huggingface/hub/models--vikhyatk--moondream2/snapshots/41cf7c96a95dea9dccd03bb8d50e03103e5293f3/moondream2-mmproj-f16.gguf", verbosity=1) at examples/llava/clip.cpp:1111
1111	    clip_ctx * new_clip = new clip_ctx{};
(gdb) n
1115	        int idx = gguf_find_key(ctx, KEY_PROJ_TYPE);
(gdb) print *new_clip
$11 = {has_text_encoder = false, has_vision_encoder = false, has_llava_projector = false, has_minicpmv_projector = false, 
  minicpmv_version = 2, vision_model = {hparams = {image_size = 0, patch_size = 0, hidden_size = 0, n_intermediate = 0, 
      projection_dim = 0, n_head = 0, n_layer = 0, eps = 0, mm_patch_merge_type = "flat", '\000' <repeats 27 times>, 
      image_grid_pinpoints = {0 <repeats 32 times>}, image_crop_resolution = 0}, class_embedding = 0x0, 
    patch_embeddings = 0x0, patch_bias = 0x0, position_embeddings = 0x0, pre_ln_w = 0x0, pre_ln_b = 0x0, 
    layers = std::vector of length 0, capacity 0, post_ln_w = 0x0, post_ln_b = 0x0, projection = 0x0, mm_0_w = 0x0, 
    mm_0_b = 0x0, mm_2_w = 0x0, mm_2_b = 0x0, image_newline = 0x0, mm_1_w = 0x0, mm_1_b = 0x0, mm_3_w = 0x0, mm_3_b = 0x0, 
    mm_4_w = 0x0, mm_4_b = 0x0, mm_model_mlp_1_w = 0x0, mm_model_mlp_1_b = 0x0, mm_model_mlp_3_w = 0x0, 
    mm_model_mlp_3_b = 0x0, mm_model_block_1_block_0_0_w = 0x0, mm_model_block_1_block_0_1_w = 0x0, 
    mm_model_block_1_block_0_1_b = 0x0, mm_model_block_1_block_1_fc1_w = 0x0, mm_model_block_1_block_1_fc1_b = 0x0, 
    mm_model_block_1_block_1_fc2_w = 0x0, mm_model_block_1_block_1_fc2_b = 0x0, mm_model_block_1_block_2_0_w = 0x0, 
    mm_model_block_1_block_2_1_w = 0x0, mm_model_block_1_block_2_1_b = 0x0, mm_model_block_2_block_0_0_w = 0x0, 
    mm_model_block_2_block_0_1_w = 0x0, mm_model_block_2_block_0_1_b = 0x0, mm_model_block_2_block_1_fc1_w = 0x0, 
    mm_model_block_2_block_1_fc1_b = 0x0, mm_model_block_2_block_1_fc2_w = 0x0, mm_model_block_2_block_1_fc2_b = 0x0, 
    mm_model_block_2_block_2_0_w = 0x0, mm_model_block_2_block_2_1_w = 0x0, mm_model_block_2_block_2_1_b = 0x0, 
    mm_model_mlp_0_w = 0x0, mm_model_mlp_0_b = 0x0, mm_model_mlp_2_w = 0x0, mm_model_mlp_2_b = 0x0, mm_model_peg_0_w = 0x0, 
    mm_model_peg_0_b = 0x0, mm_model_pos_embed_k = 0x0, mm_model_query = 0x0, mm_model_proj = 0x0, mm_model_kv_proj = 0x0, 
    mm_model_attn_q_w = 0x0, mm_model_attn_q_b = 0x0, mm_model_attn_k_w = 0x0, mm_model_attn_k_b = 0x0, 
    mm_model_attn_v_w = 0x0, mm_model_attn_v_b = 0x0, mm_model_attn_o_w = 0x0, mm_model_attn_o_b = 0x0, 
    mm_model_ln_q_w = 0x0, mm_model_ln_q_b = 0x0, mm_model_ln_kv_w = 0x0, mm_model_ln_kv_b = 0x0, mm_model_ln_post_w = 0x0, 
    mm_model_ln_post_b = 0x0}, proj_type = PROJECTOR_TYPE_MLP, image_mean = {0, 0, 0}, image_std = {0, 0, 0}, 
  use_gelu = false, ftype = 1, has_class_embedding = true, has_pre_norm = true, has_post_norm = false, 
  has_patch_bias = false, ctx_gguf = 0x0, ctx_data = 0x0, buf_compute_meta = std::vector of length 0, capacity 0, 
  params_buffer = 0x0, backend = 0x0, compute_alloc = 0x0, load_image_size = 0x0}

saket424 · 2024-08-18T18:18:09Z

@fairydreaming
That one line change fixed it

saket424 · 2024-08-18T18:56:52Z

@fairydreaming
Not specifically related to your fix, I just noticed it is not offloading any layers to the GPU. Is this normal?

ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
llm_load_tensors: CPU buffer size = 2706.27 MiB

fairydreaming · 2024-08-18T19:23:05Z

@saket424 yeah, I didn't use -ngl option, so it didn't offload any layers.

fairydreaming · 2024-08-19T08:24:21Z

@monatis can you take a look at this code:

https://github.com/ggerganov/llama.cpp/blob/1b6ff90ff8301d9fe2027be2bb9fea26177d775e/examples/llava/clip.cpp#L2418-L2426

I think it's a rewritten form of your original llava 1.5 code:

        struct ggml_tensor * patches = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, num_patches);
        ggml_allocr_alloc(ctx->alloc, patches);
        if (!ggml_allocr_is_measure(ctx->alloc)) {
            for (int i = 0; i < num_patches; ++i) {
                ggml_set_i32_1d(patches, i, i+1);
            }
        }

Do you remember what is the purpose of i + 1? Is it related to vision feature select strategy? I found the following in transformers library:

https://github.com/huggingface/transformers/blob/52cb4034ada381fe1ffe8d428a1076e5411a8026/src/transformers/models/llava/modeling_llava.py#L450-L456

(note selected_image_feature[:, 1:] when vision_feature_select_strategy is default)

Since i increases from 0 to num_patches - 1, i + 1 will have value num_patches at the end that is outside the valid range of embeddings tensor dimension and causes assertion failure in GGML get rows operation.

saket424 added bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) labels Aug 17, 2024

saket424 mentioned this issue Aug 17, 2024

support MiniCPM-V-2.6 #8967

Merged

fairydreaming mentioned this issue Aug 18, 2024

llava : zero-initialize clip_ctx structure fields with aggregate initialization #9082

Merged

4 tasks

fairydreaming mentioned this issue Aug 20, 2024

llava : fix occasional undefined behavior crash #9078

Open

4 tasks

fairydreaming linked a pull request Aug 20, 2024 that will close this issue

llava : fix occasional undefined behavior crash #9078

Open

4 tasks

fairydreaming closed this as completed in #9082 Aug 21, 2024

slaren mentioned this issue Sep 12, 2024

llama : refactor sampling v2 #9294

Merged

4 tasks

This was referenced Sep 12, 2024

chore(deps): update llama.cpp mudler/LocalAI#3497

Merged

Bug: loading llava models fails #9455

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: MiniCPM-V-2.6 commit d565bb2fd5a2a58b9924a7a34e77a87c78c52137 causing crash in moondream #9066

Bug: MiniCPM-V-2.6 commit d565bb2fd5a2a58b9924a7a34e77a87c78c52137 causing crash in moondream #9066

saket424 commented Aug 17, 2024

arch-btw commented Aug 17, 2024

saket424 commented Aug 17, 2024 •

edited

Loading

LostRuins commented Aug 18, 2024

fairydreaming commented Aug 18, 2024

LostRuins commented Aug 18, 2024 •

edited

Loading

LostRuins commented Aug 18, 2024 •

edited

Loading

fairydreaming commented Aug 18, 2024

LostRuins commented Aug 18, 2024 •

edited

Loading

fairydreaming commented Aug 18, 2024

slaren commented Aug 18, 2024

fairydreaming commented Aug 18, 2024 •

edited

Loading

saket424 commented Aug 18, 2024

saket424 commented Aug 18, 2024

fairydreaming commented Aug 18, 2024

fairydreaming commented Aug 19, 2024 •

edited

Loading

Bug: MiniCPM-V-2.6 commit d565bb2fd5a2a58b9924a7a34e77a87c78c52137 causing crash in moondream #9066

Bug: MiniCPM-V-2.6 commit d565bb2fd5a2a58b9924a7a34e77a87c78c52137 causing crash in moondream #9066

Comments

saket424 commented Aug 17, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

arch-btw commented Aug 17, 2024

saket424 commented Aug 17, 2024 • edited Loading

LostRuins commented Aug 18, 2024

fairydreaming commented Aug 18, 2024

LostRuins commented Aug 18, 2024 • edited Loading

LostRuins commented Aug 18, 2024 • edited Loading

fairydreaming commented Aug 18, 2024

LostRuins commented Aug 18, 2024 • edited Loading

fairydreaming commented Aug 18, 2024

slaren commented Aug 18, 2024

fairydreaming commented Aug 18, 2024 • edited Loading

saket424 commented Aug 18, 2024

saket424 commented Aug 18, 2024

fairydreaming commented Aug 18, 2024

fairydreaming commented Aug 19, 2024 • edited Loading

saket424 commented Aug 17, 2024 •

edited

Loading

LostRuins commented Aug 18, 2024 •

edited

Loading

LostRuins commented Aug 18, 2024 •

edited

Loading

LostRuins commented Aug 18, 2024 •

edited

Loading

fairydreaming commented Aug 18, 2024 •

edited

Loading

fairydreaming commented Aug 19, 2024 •

edited

Loading