Adept Persimmon Models not working with CUDA Acceleration

I have successfully gguf-converted the [base](https://huggingface.co/maddes8cht/adept-persimmon-8b-base-gguf) and [chat](https://huggingface.co/maddes8cht/adept-persimmon-8b-chat-gguf) variants  of the [Adept Persimmon models](https://huggingface.co/adept).

But the resulting .gguf models do not work with the cuda accelaration. I need to set 
`--n-gpu-layers 0` to get these models working.

With cuda layer offloading i get this (after all the llama-model_loader: - tensor .... lines)
```
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                   persimmon.context_length u32
llama_model_loader: - kv   3:                 persimmon.embedding_length u32
llama_model_loader: - kv   4:                      persimmon.block_count u32
llama_model_loader: - kv   5:              persimmon.feed_forward_length u32
llama_model_loader: - kv   6:             persimmon.rope.dimension_count u32
llama_model_loader: - kv   7:             persimmon.attention.head_count u32
llama_model_loader: - kv   8:          persimmon.attention.head_count_kv u32
llama_model_loader: - kv   9:                   persimmon.rope.freq_base f32
llama_model_loader: - kv  10:     persimmon.attention.layer_norm_epsilon f32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  17:               general.quantization_version u32
llama_model_loader: - kv  18:                          general.file_type u32
llama_model_loader: - type  f32:  434 tensors
llama_model_loader: - type q4_1:  145 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 76599/262144 vs 259/262144 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = persimmon
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 262144
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 64
llm_load_print_meta: n_layer          = 36
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 25000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = mostly Q4_1
llm_load_print_meta: model params     = 9.40 B
llm_load_print_meta: model size       = 5.67 GiB (5.18 BPW)
llm_load_print_meta: general.name   = persimmon-8b-chat
llm_load_print_meta: BOS token = 71013 '|ENDOFTEXT|'
llm_load_print_meta: EOS token = 71013 '|ENDOFTEXT|'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 71128 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.21 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 4967.56 MB
llm_load_tensors: offloading 36 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 39/39 layers to GPU
llm_load_tensors: VRAM used: 840.03 MB
............................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 25000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 288.00 MB
llama_new_context_with_model: kv self size  =  288.00 MB
llama_build_graph: non-view tensors processed: 1481/1481
llama_new_context_with_model: compute buffer total size = 7.66 MB
llama_new_context_with_model: VRAM scratch buffer: 1.03 MB
llama_new_context_with_model: total VRAM used: 5456.41 MB (model: 5167.38 MB, context: 289.03 MB)
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7510: src1->backend == GGML_BACKEND_GPU
```

I know that the current persimmon script does only operate on the files provided via the Link in hteir GitHub repository, and that this is going to be changed to work with the huggingface repos, so this may not be changed for the current script at all but only for the new one.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adept Persimmon Models not working with CUDA Acceleration #4038

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adept Persimmon Models not working with CUDA Acceleration #4038

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions