Description
First I have downloaded meta-llama/Meta-Llama-3-70B-Instruct
model from HF. Then I converted it using convert.py
script from llama.cpp
to f16, like this:
python3.12 ~/Software/AI/llama.cpp/convert.py Meta-Llama-3-70B-Instruct/ --outfile Meta-Llama-3-70B-Instruct.f16.gguf --outtype f16 --vocab-type bpe
This worked fine and produced a 108GB file. Unfortunately, I could not load it in my server, because it only has 128GB RAM and RTX 2080 Ti with 11GB VRAM, so there was no way to load it either with or without -ngl
option. So, I converted the original HF files to Q8_0
instead (again using convert.py
) and it also could not be loaded. Then I decided to quantize the f16 .gguf
file using the quantize
utility from llama.cpp
and this is where the problems started. I naturally started from the highest quality Q6_K
:
$ ./quantize /data/Llama/Meta-Llama-3-70B-Instruct.f16.gguf /data/work/Meta-Llama-3-70B-Instruct.Q6_K.gguf Q6_K 12
main: build = 2840 (25c6e82e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/data/Llama/Meta-Llama-3-70B-Instruct.f16.gguf' to '/data/work/Meta-Llama-3-70B-Instruct.Q6_K.gguf' as Q6_K using 12 threads
llama_model_loader: loaded meta data with 21 key-value pairs and 595 tensors from /data/Llama/Meta-Llama-3-70B-Instruct.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = .
llama_model_loader: - kv 2: llama.vocab_size u32 = 128256
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 8192
llama_model_loader: - kv 5: llama.block_count u32 = 80
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: llama.attention.head_count u32 = 64
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 12: general.file_type u32 = 1
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - type f32: 132 tensors
llama_model_loader: - type f16: 463 tensors
GGML_ASSERT: llama.cpp:14705: (qs.n_attention_wv == 0 || qs.n_attention_wv == (int)model.hparams.n_layer) && "n_attention_wv is unexpected"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f314e2ea42f in __GI___wait4 (pid=1819, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007f314e2ea42f in __GI___wait4 (pid=1819, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x000055761fcd440b in ggml_print_backtrace ()
#2 0x000055761fd73f4a in llama_model_quantize_internal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, llama_model_quantize_params const*) ()
#3 0x000055761fd74914 in llama_model_quantize ()
#4 0x000055761fcd1911 in main ()
[Inferior 1 (process 1807) detached]
Aborted
Then I tried Q5_K_M
(omitting number of threads, which made no difference):
$ ./quantize /data/Llama/Meta-Llama-3-70B-Instruct.f16.gguf /data/work/Meta-Llama-3-70B-Instruct.Q5_K_M.gguf Q5_K_M
main: build = 2840 (25c6e82e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/data/Llama/Meta-Llama-3-70B-Instruct.f16.gguf' to '/data/work/Meta-Llama-3-70B-Instruct.Q5_K_M.gguf' as Q5_K_M
llama_model_loader: loaded meta data with 21 key-value pairs and 595 tensors from /data/Llama/Meta-Llama-3-70B-Instruct.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = .
llama_model_loader: - kv 2: llama.vocab_size u32 = 128256
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 8192
llama_model_loader: - kv 5: llama.block_count u32 = 80
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: llama.attention.head_count u32 = 64
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 12: general.file_type u32 = 1
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - type f32: 132 tensors
llama_model_loader: - type f16: 463 tensors
GGML_ASSERT: llama.cpp:14705: (qs.n_attention_wv == 0 || qs.n_attention_wv == (int)model.hparams.n_layer) && "n_attention_wv is unexpected"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f17e16ea42f in __GI___wait4 (pid=1916, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007f17e16ea42f in __GI___wait4 (pid=1916, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x0000559cc1b2740b in ggml_print_backtrace ()
#2 0x0000559cc1bc6f4a in llama_model_quantize_internal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, llama_model_quantize_params const*) ()
#3 0x0000559cc1bc7914 in llama_model_quantize ()
#4 0x0000559cc1b24911 in main ()
[Inferior 1 (process 1904) detached]
Aborted
And so on, I tried a few more types, which all failed likewise:
$ ./quantize /data/Llama/Meta-Llama-3-70B-Instruct.f16.gguf /data/work/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf Q4_K_M
main: build = 2840 (25c6e82e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/data/Llama/Meta-Llama-3-70B-Instruct.f16.gguf' to '/data/work/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 21 key-value pairs and 595 tensors from /data/Llama/Meta-Llama-3-70B-Instruct.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = .
llama_model_loader: - kv 2: llama.vocab_size u32 = 128256
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 8192
llama_model_loader: - kv 5: llama.block_count u32 = 80
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: llama.attention.head_count u32 = 64
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 12: general.file_type u32 = 1
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - type f32: 132 tensors
llama_model_loader: - type f16: 463 tensors
GGML_ASSERT: llama.cpp:14705: (qs.n_attention_wv == 0 || qs.n_attention_wv == (int)model.hparams.n_layer) && "n_attention_wv is unexpected"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f7d046ea42f in __GI___wait4 (pid=1959, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007f7d046ea42f in __GI___wait4 (pid=1959, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00005626045b840b in ggml_print_backtrace ()
#2 0x0000562604657f4a in llama_model_quantize_internal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, llama_model_quantize_params const*) ()
#3 0x0000562604658914 in llama_model_quantize ()
#4 0x00005626045b5911 in main ()
[Inferior 1 (process 1947) detached]
Aborted
The version of llama.cpp is very recent -- cloned yesterday evening.