Description
Name and Version
version: 4391 (9ba399d)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.1.0
Operating systems
Mac (M4 Max / 128 GB)
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
./build/bin/llama-server -m /Users/mattsinalco/.cache/huggingface/hub/models--unsloth--Llama-3.3-70B-Instruct-GGUF/snapshots/0c14ebbedd129fb190c8241facca9a360e81c650/Llama-3.3-70B-Instruct-Q4_K_M.gguf -md /Users/mattsinalco/.cache/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct-GGUF/snapshots/a5594fb18df5dfc6b43281423fcce6750cd92de5/Llama-3.2-1B-Instruct-Q4_K_M.gguf -ngl 99 -ngld 99 -fa --port 8034 --ctx-size 8192 --ctx-size-draft 8192 --draft-min 0 --draft-max 16 -np 7 --host 0.0.0.0 --slots --slot-save-path /Users/mattsinalco/mathias/caching -ctk q4_1 -ctv q4_1
Sometimes (reproducibly) gives me this:
/Users/mattsinalco/mathias/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op
ggml_metal_encode_node: error: unsupported op 'CPY'
Other quantizations give me this:
zsh: segmentation fault ./build/bin/llama-server -m -md -ngl 99 -ngld 99 -fa --port 8034 --ctx-size
Related question - in the absence of quantization the KV cache workign reliabely, can I resize the KV cache size? I can't seem to load slots of 200MB (100MB is possible).
First Bad Commit
No response
Relevant log output
No response