Skip to content

Adept Persimmon Models not working with CUDA Acceleration #4038

Closed
@maddes8cht

Description

@maddes8cht

I have successfully gguf-converted the base and chat variants of the Adept Persimmon models.

But the resulting .gguf models do not work with the cuda accelaration. I need to set
--n-gpu-layers 0 to get these models working.

With cuda layer offloading i get this (after all the llama-model_loader: - tensor .... lines)

llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                   persimmon.context_length u32
llama_model_loader: - kv   3:                 persimmon.embedding_length u32
llama_model_loader: - kv   4:                      persimmon.block_count u32
llama_model_loader: - kv   5:              persimmon.feed_forward_length u32
llama_model_loader: - kv   6:             persimmon.rope.dimension_count u32
llama_model_loader: - kv   7:             persimmon.attention.head_count u32
llama_model_loader: - kv   8:          persimmon.attention.head_count_kv u32
llama_model_loader: - kv   9:                   persimmon.rope.freq_base f32
llama_model_loader: - kv  10:     persimmon.attention.layer_norm_epsilon f32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  17:               general.quantization_version u32
llama_model_loader: - kv  18:                          general.file_type u32
llama_model_loader: - type  f32:  434 tensors
llama_model_loader: - type q4_1:  145 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 76599/262144 vs 259/262144 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = persimmon
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 262144
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 64
llm_load_print_meta: n_layer          = 36
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 25000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = mostly Q4_1
llm_load_print_meta: model params     = 9.40 B
llm_load_print_meta: model size       = 5.67 GiB (5.18 BPW)
llm_load_print_meta: general.name   = persimmon-8b-chat
llm_load_print_meta: BOS token = 71013 '|ENDOFTEXT|'
llm_load_print_meta: EOS token = 71013 '|ENDOFTEXT|'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 71128 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.21 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 4967.56 MB
llm_load_tensors: offloading 36 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 39/39 layers to GPU
llm_load_tensors: VRAM used: 840.03 MB
............................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 25000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 288.00 MB
llama_new_context_with_model: kv self size  =  288.00 MB
llama_build_graph: non-view tensors processed: 1481/1481
llama_new_context_with_model: compute buffer total size = 7.66 MB
llama_new_context_with_model: VRAM scratch buffer: 1.03 MB
llama_new_context_with_model: total VRAM used: 5456.41 MB (model: 5167.38 MB, context: 289.03 MB)
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7510: src1->backend == GGML_BACKEND_GPU

I know that the current persimmon script does only operate on the files provided via the Link in hteir GitHub repository, and that this is going to be changed to work with the huggingface repos, so this may not be changed for the current script at all but only for the new one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions