Description
Name and Version
llama-mtmd-cli --version
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Pro)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Pro)
version: 5317 (f05a6d7)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.3.0
Operating systems
Mac
GGML backends
Metal
Hardware
MacBook M3 Pro 36GB
Models
Qwen2.5-VL-7B-Instruct (Q4_K_M)
Problem description & steps to reproduce
I noticed some situations where Qwen 2.5VL appears to "miss" or partially "miss" the image in certain prompts.
To create this situation in mtmd-cli.cpp
without any fuss, I hardcoded the following for the formatted prompt:
- LOG_DBG("formatted_chat.prompt: %s\n", formatted_chat.prompt.c_str());
-
- // text.text = formatted_chat.prompt.c_str();
+ text.text = R"(<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
I'm going to tell you a story:
STORY START
The rain hammered against the windows of the antique shop, mirroring the drumming in Leo’s chest. He'd been searching for weeks, driven by a faded photograph – his grandfather, Silas, holding a peculiar silver compass. Silas had vanished without a trace fifty years ago, leaving only this enigmatic object and a whispered legend about hidden treasures.
Old Mr. Finch, the shop owner, a man who smelled perpetually of dust and beeswax, finally pointed to a small, locked box tucked away in the darkest corner. “Silas brought this in,” he rasped, his voice like rustling parchment. "Said it held the key."
Leo bought the box, his hands trembling as he wrestled with the stubborn lock. Finally, it sprung open, revealing not gold or jewels, but the silver compass. As Leo picked it up, a tiny inscription on its base caught his eye: “Follow your heart.”
Suddenly, a faint scent of pine needles and saltwater filled the air, and Leo knew, instinctively, that Silas hadn’t vanished – he'd simply been waiting for someone to understand.
STORY START
Now, ignore that story and tell me what this image is of<__image__><|im_end|>
<|im_start|>assistant
)";
+ LOG_ERR("formatted_chat.prompt: %s\n", text.text);
and then run the following command:
llama-mtmd-cli -m /Users/matt/.cache/lm-studio/models/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf --mmproj /Users/matt/.cache/lm-studio/models/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --temp 0 --image /Users/matt/Workspace/pikachu.png --prompt "unused dummy prompt"
And get response where the model claims that no image has been provided:
encoding image or slice...
image/slice encoded in 3214 ms
decoding image batch 1/1, n_tokens_batch = 289
image decoded (batch 1/1) in 1157 ms
I'm sorry, but you haven't provided an image for me to describe. If you have an image you'd like me to describe or analyze, please upload it or describe it, and I'll be happy to help!
Gemma 3 4B (https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF) does not have this issue, and responds with:
encoding image or slice...
image/slice encoded in 20257 ms
decoding image batch 1/1, n_tokens_batch = 256
image decoded (batch 1/1) in 472 ms
That image is of Pikachu, a popular Pokémon character from the Pokémon franchise! He's known for his yellow fur, red cheeks, and electric powers.
First Bad Commit
No response
Relevant log output
/Users/matt/Workspace/llama.cpp/cmake-build-debug/bin/llama-mtmd-cli -m /Users/matt/.cache/lm-studio/models/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf --mmproj /Users/matt/.cache/lm-studio/models/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --temp 0 --image /Users/matt/Workspace/pikachu.png --prompt dummy
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Pro)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Pro)
build: 5317 (f05a6d71) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.3.0 (debug)
llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /Users/matt/.cache/lm-studio/models/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2vl
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 VL 7B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5-VL
llama_model_loader: - kv 5: general.size_label str = 7B
llama_model_loader: - kv 6: qwen2vl.block_count u32 = 28
llama_model_loader: - kv 7: qwen2vl.context_length u32 = 128000
llama_model_loader: - kv 8: qwen2vl.embedding_length u32 = 3584
llama_model_loader: - kv 9: qwen2vl.feed_forward_length u32 = 18944
llama_model_loader: - kv 10: qwen2vl.attention.head_count u32 = 28
llama_model_loader: - kv 11: qwen2vl.attention.head_count_kv u32 = 4
llama_model_loader: - kv 12: qwen2vl.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2vl.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: qwen2vl.rope.dimension_sections arr[i32,4] = [16, 24, 24, 0]
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {% set image_count = namespace(value=...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: general.file_type u32 = 15
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q4_K: 169 tensors
llama_model_loader: - type q6_K: 29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.36 GiB (4.91 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2vl
print_info: vocab_only = 0
print_info: n_ctx_train = 128000
print_info: n_embd = 3584
print_info: n_layer = 28
print_info: n_head = 28
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18944
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 8
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 128000
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 7B
print_info: model params = 7.62 B
print_info: general.name = Qwen2.5 VL 7B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: Metal_Mapped model buffer size = 4460.45 MiB
load_tensors: CPU_Mapped model buffer size = 292.36 MiB
..................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name: Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets = false
ggml_metal_init: has bfloat = true
ggml_metal_init: use bfloat = false
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 28991.03 MB
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
llama_context: CPU output buffer size = 0.58 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1, padding = 32
llama_kv_cache_unified: Metal KV buffer size = 224.00 MiB
llama_kv_cache_unified: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_context: Metal compute buffer size = 304.00 MiB
llama_context: CPU compute buffer size = 16.01 MiB
llama_context: graph nodes = 1042
llama_context: graph splits = 114
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_init: GPU name: Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets = false
ggml_metal_init: has bfloat = true
ggml_metal_init: use bfloat = false
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 28991.03 MB
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
clip_ctx: CLIP using Metal backend
clip_model_loader: model name: Qwen2.5 VL 7B Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 519
clip_model_loader: n_kv: 22
load_hparams: projector: qwen2.5vl_merger
load_hparams: n_embd: 1280
load_hparams: n_head: 16
load_hparams: n_ff: 3420
load_hparams: n_layer: 32
load_hparams: projection_dim: 3584
load_hparams: image_size: 560
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 8
load_hparams: ffn_op: silu
load_hparams: model size: 1291.40 MiB
load_hparams: metadata size: 0.18 MiB
alloc_compute_meta: Metal compute buffer size = 200.86 MiB
alloc_compute_meta: CPU compute buffer size = 29.01 MiB
main: loading model: /Users/matt/.cache/lm-studio/models/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf
formatted_chat.prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
I'm going to tell you a story:
STORY START
The rain hammered against the windows of the antique shop, mirroring the drumming in Leo’s chest. He'd been searching for weeks, driven by a faded photograph – his grandfather, Silas, holding a peculiar silver compass. Silas had vanished without a trace fifty years ago, leaving only this enigmatic object and a whispered legend about hidden treasures.
Old Mr. Finch, the shop owner, a man who smelled perpetually of dust and beeswax, finally pointed to a small, locked box tucked away in the darkest corner. “Silas brought this in,” he rasped, his voice like rustling parchment. "Said it held the key."
Leo bought the box, his hands trembling as he wrestled with the stubborn lock. Finally, it sprung open, revealing not gold or jewels, but the silver compass. As Leo picked it up, a tiny inscription on its base caught his eye: “Follow your heart.”
Suddenly, a faint scent of pine needles and saltwater filled the air, and Leo knew, instinctively, that Silas hadn’t vanished – he'd simply been waiting for someone to understand.
STORY START
Now, ignore that story and tell me what this image is of<__image__><|im_end|>
<|im_start|>assistant
encoding image or slice...
image/slice encoded in 3122 ms
decoding image batch 1/1, n_tokens_batch = 289
image decoded (batch 1/1) in 1056 ms
I'm sorry, but you haven't provided an image for me to describe. If you have an image you'd like me to describe or analyze, please upload it or describe it, and I'll be happy to help!
llama_perf_context_print: load time = 19621.77 ms
llama_perf_context_print: prompt eval time = 5298.57 ms / 568 tokens ( 9.33 ms per token, 107.20 tokens per second)
llama_perf_context_print: eval time = 2360.08 ms / 45 runs ( 52.45 ms per token, 19.07 tokens per second)
llama_perf_context_print: total time = 11014.97 ms / 613 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating
Process finished with exit code 0