Description
Hi, while testing #6491 branch, I downloaded a Q8_0 quant (split into 3 files) from dranger003
, and re-quantized it to Q2_K_S to make it more digestible for my museum hardware:
./quantize --allow-requantize --imatrix ../models/ggml-c4ai-command-r-plus-104b-f16-imatrix.dat ../models/ggml-c4ai-command-r-plus-104b-q8_0-00001-of-00003.gguf ../models/command-r-plus-104b-Q2_K_S.gguf Q2_K_S 2
I only passed the first piece, but ./quantize
processed it correctly and produced a single file with the expected size. However, it probably did not update some metadata and ./main
still thinks the result is a split file:
./main -m ../models/command-r-plus-104b-Q2_K_S.gguf -t 15 --color -p "this is a test" -c 2048 -ngl 25 -ctk q8_0
...
llama_model_load: error loading model: invalid split file: ../models/command-r-plus-104b-Q2_K_S.gguf
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../models/command-r-plus-104b-Q2_K_S.gguf'
main: error: unable to load model
As a workaround, it is possible to "reset" the metadata by doing a "dummy pass" of gguf-split
:
./gguf-split --split-max-tensors 999 --split ../models/command-r-plus-104b-Q2_K_S.gguf ../models/command-r-plus-104b-Q2_K_S.gguf.split
The resulting file then seems to be working fine.
It's probably an easy fix, but after a quick grep through the source and a look at quantize.cpp
I figured I don't even know where to start, so it would be probably much easier and faster done by someone who knows the code-base.