LLaVA does not offload layers to GPU

The issue was already [mentioned ](https://github.com/ggerganov/llama.cpp/pull/3436#issuecomment-1758896152) in #3436. Creating a separate issue so that it does not get lost. 

I run LLaVA with (commit id: 1e0e873c373c33989beb6bc64d83cd572ab7fe2b)
```
./llava -m ggml-model-q5_k.gguf \
        --mmproj mmproj-model-f16.gguf \
        --temp 0.1 -ngl 64 -mg 0 \
        --image n008-2018-09-18-14-54-39-0400__CAM_FRONT__1537297366762404.jpg
```

This the relevant parts from the output:
````
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6

...

llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  = 4560.96 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_new_context_with_model: compute buffer total size = 162.13 MB
llama_new_context_with_model: VRAM scratch buffer: 156.00 MB
llama_new_context_with_model: total VRAM used: 156.00 MB (model: 0.00 MB, context: 156.00 MB)

...

main: image encoded in  1561.49 ms by CLIP (    2.71 ms per image patch)

llama_print_timings:        load time =    3042.21 ms
llama_print_timings:      sample time =      11.65 ms /   136 runs   (    0.09 ms per token, 11671.82 tokens per second)
llama_print_timings: prompt eval time =    9440.69 ms /   626 tokens (   15.08 ms per token,    66.31 tokens per second)
llama_print_timings:        eval time =   47661.78 ms /   136 runs   (  350.45 ms per token,     2.85 tokens per second)
llama_print_timings:       total time =   58800.36 ms
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLaVA does not offload layers to GPU #3616

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLaVA does not offload layers to GPU #3616

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions