Closed
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Modified save-load-state example
I applied the following little patch to save-load-state:
diff --git a/examples/save-load-state/save-load-state.cpp b/examples/save-load-state/save-load-state.cpp
index 4c86885..6ba8b42 100644
--- a/examples/save-load-state/save-load-state.cpp
+++ b/examples/save-load-state/save-load-state.cpp
@@ -25,6 +25,8 @@ int main(int argc, char ** argv) {
auto lparams = llama_context_default_params();
+ // Added n_gpu_layers and passed -ngl on CLI
+ lparams.n_gpu_layers = params.n_gpu_layers;
lparams.n_ctx = params.n_ctx;
lparams.seed = params.seed;
lparams.f16_kv = params.memory_f16;
Execution with GPU offloading 32 Layers of 35 total
And then I compiled and executed it without layers:
$ make LLAMA_CUBLAS=1 save-load-state
$ ./save-load-state -ngl 32 -m /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin
I get the expected output:
$ ./save-load-state -ngl 32 -m /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin
main: build = 917 (1a94186)
llama.cpp: loading model from /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 442.73 MB (+ 256.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/35 layers to GPU
llama_model_load_internal: total VRAM used: 4007 MB
llama_new_context_with_model: kv self size = 256.00 MB
The quick brown fox jumped over the lazy dog.
The above sentence may very well imply a
llama_new_context_with_model: kv self size = 256.00 MB
jumped over the lazy dog.
The above sentence may very well imply a
Execution with 33 of 35 offloaded GPU layers
The problem when I pass -ngl 33
I get different results between saving and loading the state.
The results get more wild the more I offload to the GPU.
Maybe the offloaded v cache might has something to do with this?
$ ./save-load-state -ngl 34 -m /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin
main: build = 917 (1a94186)
llama.cpp: loading model from /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 372.40 MB (+ 256.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloaded 34/35 layers to GPU
llama_model_load_internal: total VRAM used: 4205 MB
llama_new_context_with_model: kv self size = 256.00 MB
The quick brown fox jumped over the lazy dog.
The dog did nothing while the sleeping
llama_new_context_with_model: kv self size = 256.00 MB
jump. in which the 21st -century Italian journal ofQuantum
This is what I get when offloading all 35:
$ ./save-load-state -ngl 35 -m /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin
...
llama_model_load_internal: mem required = 372.40 MB (+ 256.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 4333 MB
llama_new_context_with_model: kv self size = 256.00 MB
The quick brown fox jumped over the lazy dog.
The dog did nothing while the story took
llama_new_context_with_model: kv self size = 256.00 MB
jumped over in anticipation of a big market ad trz on interest group
Environment and Context
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 5 1600 Six-Core Processor
CPU family: 23
Model: 1
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 1
Frequency boost: enabled
CPU max MHz: 3200,0000
CPU min MHz: 1550,0000
BogoMIPS: 6387.33
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht sysc
all nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmp
erf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_l
m cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfc
tr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2
rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv s
vm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vm
load vgif overflow_recov succor smca sev
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 192 KiB (6 instances)
L1i: 384 KiB (6 instances)
L2: 3 MiB (6 instances)
L3: 16 MiB (2 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-11
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Mitigation; untrained return thunk; SMT vulnerable
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
- Operating System, e.g. for Linux:
$ uname -a
Linux mockwork 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
- SDK version, e.g. for Linux:
weicon@mockwork:~/textgen/llama_fork$ python3 --version
Python 3.10.6
weicon@mockwork:~/textgen/llama_fork$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
weicon@mockwork:~/textgen/llama_fork$ g++ --version
g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Steps to Reproduce
- Modify save-load-data example as explained above
- Execute some model with 35 layers with eg.
-ngl 20
- Execute the model with
-ngl 35
- Observe the differences before and after restoring the context state
llama.cpp$ git log | head -1
commit 1a941869cbef8e9cc351a6c6987e4ae3b0f021f7
Please note: The bug did already exist before that commit.
llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
numpy 1.21.5
numpy-stl 2.8.0
$ md5sum /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin
2a869ad143efcd003972b0aed196ed0a /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin