Closed
Description
The issue was already mentioned in #3436. Creating a separate issue so that it does not get lost.
I run LLaVA with (commit id: 1e0e873)
./llava -m ggml-model-q5_k.gguf \
--mmproj mmproj-model-f16.gguf \
--temp 0.1 -ngl 64 -mg 0 \
--image n008-2018-09-18-14-54-39-0400__CAM_FRONT__1537297366762404.jpg
This the relevant parts from the output:
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6
...
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required = 4560.96 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 1024.00 MB
llama_new_context_with_model: compute buffer total size = 162.13 MB
llama_new_context_with_model: VRAM scratch buffer: 156.00 MB
llama_new_context_with_model: total VRAM used: 156.00 MB (model: 0.00 MB, context: 156.00 MB)
...
main: image encoded in 1561.49 ms by CLIP ( 2.71 ms per image patch)
llama_print_timings: load time = 3042.21 ms
llama_print_timings: sample time = 11.65 ms / 136 runs ( 0.09 ms per token, 11671.82 tokens per second)
llama_print_timings: prompt eval time = 9440.69 ms / 626 tokens ( 15.08 ms per token, 66.31 tokens per second)
llama_print_timings: eval time = 47661.78 ms / 136 runs ( 350.45 ms per token, 2.85 tokens per second)
llama_print_timings: total time = 58800.36 ms
```
Metadata
Metadata
Assignees
Labels
No labels