Skip to content

[Bug] llama_set_state_data() does not work correctly with offloaded GPU Layers (kv cache) #2422

Closed
@WeirdConstructor

Description

@WeirdConstructor

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Modified save-load-state example

I applied the following little patch to save-load-state:

diff --git a/examples/save-load-state/save-load-state.cpp b/examples/save-load-state/save-load-state.cpp
index 4c86885..6ba8b42 100644
--- a/examples/save-load-state/save-load-state.cpp
+++ b/examples/save-load-state/save-load-state.cpp
@@ -25,6 +25,8 @@ int main(int argc, char ** argv) {
 
     auto lparams = llama_context_default_params();
 
+    // Added n_gpu_layers and passed -ngl on CLI
+    lparams.n_gpu_layers = params.n_gpu_layers;
     lparams.n_ctx     = params.n_ctx;
     lparams.seed      = params.seed;
     lparams.f16_kv    = params.memory_f16;

Execution with GPU offloading 32 Layers of 35 total

And then I compiled and executed it without layers:

$ make LLAMA_CUBLAS=1 save-load-state 
$ ./save-load-state -ngl 32 -m /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin

I get the expected output:

$ ./save-load-state -ngl 32  -m /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin 
main: build = 917 (1a94186)
llama.cpp: loading model from /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  442.73 MB (+  256.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/35 layers to GPU
llama_model_load_internal: total VRAM used: 4007 MB
llama_new_context_with_model: kv self size  =  256.00 MB

The quick brown fox jumped over the lazy dog.
The above sentence may very well imply a

llama_new_context_with_model: kv self size  =  256.00 MB
 jumped over the lazy dog.
The above sentence may very well imply a

Execution with 33 of 35 offloaded GPU layers

The problem when I pass -ngl 33 I get different results between saving and loading the state.
The results get more wild the more I offload to the GPU.

Maybe the offloaded v cache might has something to do with this?

$ ./save-load-state -ngl 34  -m /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin 
main: build = 917 (1a94186)
llama.cpp: loading model from /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  372.40 MB (+  256.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloaded 34/35 layers to GPU
llama_model_load_internal: total VRAM used: 4205 MB
llama_new_context_with_model: kv self size  =  256.00 MB

The quick brown fox jumped over the lazy dog.
The dog did nothing while the sleeping

llama_new_context_with_model: kv self size  =  256.00 MB
 jump. in which the 21st -century Italian journal ofQuantum

This is what I get when offloading all 35:

$ ./save-load-state -ngl 35  -m /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin 
...
llama_model_load_internal: mem required  =  372.40 MB (+  256.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 4333 MB
llama_new_context_with_model: kv self size  =  256.00 MB

The quick brown fox jumped over the lazy dog.
The dog did nothing while the story took

llama_new_context_with_model: kv self size  =  256.00 MB
 jumped over in anticipation of a big market ad trz on interest group

Environment and Context

lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 5 1600 Six-Core Processor
    CPU family:          23
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            1
    Frequency boost:     enabled
    CPU max MHz:         3200,0000
    CPU min MHz:         1550,0000
    BogoMIPS:            6387.33
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht sysc
                         all nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmp
                         erf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_l
                         m cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfc
                         tr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 
                         rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv s
                         vm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vm
                         load vgif overflow_recov succor smca sev
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   192 KiB (6 instances)
  L1i:                   384 KiB (6 instances)
  L2:                    3 MiB (6 instances)
  L3:                    16 MiB (2 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Mitigation; untrained return thunk; SMT vulnerable
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
  • Operating System, e.g. for Linux:

$ uname -a
Linux mockwork 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

  • SDK version, e.g. for Linux:
weicon@mockwork:~/textgen/llama_fork$ python3 --version
Python 3.10.6

weicon@mockwork:~/textgen/llama_fork$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

weicon@mockwork:~/textgen/llama_fork$ g++ --version
g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Steps to Reproduce

  1. Modify save-load-data example as explained above
  2. Execute some model with 35 layers with eg. -ngl 20
  3. Execute the model with -ngl 35
  4. Observe the differences before and after restoring the context state
llama.cpp$ git log | head -1
commit 1a941869cbef8e9cc351a6c6987e4ae3b0f021f7

Please note: The bug did already exist before that commit.

llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
numpy                    1.21.5
numpy-stl                2.8.0

$ md5sum /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin
2a869ad143efcd003972b0aed196ed0a  /mnt/old/root/new_data/guanaco-7B.ggmlv3.q4_K_M_by_TheBloke_20230525.bin

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions