Description
Prerequisites
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
Cloning a context state should work.
Current Behavior
Cloning a context state fails.
I'm trying to clone a context state, but it fails on an assert on llama_set_state_data
.
It may also be a good idea to provide a dedicated function to clone a context state to make the operation more memory efficient.
Environment and Context
I have a 2021 MacBook Pro with M1 Max, 32 GPU cores, 32GB RAM.
macOS Sonoma 14.0
Make 3.81
cmake version 3.26.5
Apple clang version 15.0.0 (clang-1500.0.40.1)
Target: arm64-apple-darwin23.0.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Failure Information (for bugs)
ggml_metal_init: loaded kernel_cpy_f32_f32 0x14ba1d3c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f16_f16 0x14ba1d610 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_concat 0x14ba1d860 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_sqr 0x14ba1dab0 | th_max = 1024 | th_width = 32
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 81.13 MB
llama_new_context_with_model: max tensor size = 312.66 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8967.25 MB, (30939.41 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, (31341.41 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 75.02 MB, (31416.42 / 21845.34), warning: current allocated size is greater than the recommended max working set size
GGML_ASSERT: /Users/user/Documents/workspace/llama.cpp/llama.cpp:9084: ctx->logits.capacity() == logits_cap
Steps to Reproduce
// initilize one context and call `llama_decode` on it
context_params = llama_context_default_params();
llama_context* other_ctx = llama_new_context_with_model(model->model, context_params);
std::vector<llama_token> tokens = llama_tokenize(other_ctx, "Hi there", true);
llama_batch batch = llama_batch_init(tokens.size(), 0);
batch.n_tokens = tokens.size();
int n_cur = 0;
for (int32_t i = 0; i < batch.n_tokens; i++) {
batch.token[i] = tokens[i];
batch.pos[i] = n_cur;
batch.seq_id[i] = 0;
batch.logits[i] = false;
n_cur++;
}
batch.logits[batch.n_tokens - 1] = true;
int r = llama_decode(other_ctx, batch);
// try to clone the state
const size_t state_size = llama_get_state_size(other_ctx);
uint8_t * state_mem = new uint8_t[state_size];
llama_copy_state_data(other_ctx, state_mem); // if fails here
llama_context* ctx = llama_new_context_with_model(model->model, context_params);
llama_set_state_data(ctx, state_mem);
delete[] state_mem;
Failure Logs
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Also, please try to avoid using screenshots if at all possible. Instead, copy/paste the console output and use Github's markdown to cleanly format your logs for easy readability.
Example environment info:
$ git log | head -1
commit 233fc1c69f6f415f35363e18a755f9610e89161b
$ make --version | head -1
GNU Make 3.81
$ md5 ./models/codellama-13b.Q5_K_M.gguf
MD5 (./models/codellama-13b.Q5_K_M.gguf) = c2b04b8d642d0030bc40f380882bd5de
Example run with the Linux command perf
llama_model_loader: loaded meta data with 17 key-value pairs and 363 tensors from ./models/codellama-13b.Q5_K_M.gguf (version GGUF V1 (support until nov 2023))
llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 5120, 32016, 1, 1 ]
llama_model_loader: - tensor 1: output_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 2: output.weight f16 [ 5120, 32016, 1, 1 ]
...
llama_model_loader: - tensor 361: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 362: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: llama.rope.freq_base f32
llama_model_loader: - kv 11: general.file_type u32
llama_model_loader: - kv 12: tokenizer.ggml.model str
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr
llama_model_loader: - kv 14: tokenizer.ggml.scores arr
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
llama_model_loader: - kv 16: general.quantization_version u32
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type q4_0: 1 tensors
llama_model_loader: - type q5_K: 240 tensors
llama_model_loader: - type q6_K: 40 tensors
llm_load_print_meta: format = GGUF V1 (support until nov 2023)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32016
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 8.76 GiB (5.78 BPW)
llm_load_print_meta: general.name = LLaMA
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.12 MB
llm_load_tensors: mem required = 8966.75 MB
.................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 3200.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x14ba08cd0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_add_row 0x14ba095c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul 0x14ba09de0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_row 0x14ba0a690 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_scale 0x14ba0aee0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_silu 0x14ba0b690 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_relu 0x14ba0be40 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu 0x14ba0c5f0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max 0x14ba0cb20 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max_4 0x14ba0d050 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf 0x14ba0d580 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf_8 0x14ba0dc30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f32 0x14ba0e160 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f16 0x14ba0e690 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_0 0x14ba0ebc0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_1 0x14ba0f0f0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q8_0 0x14ba0f620 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q2_K 0x14ba0fb50 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q3_K 0x14ba10080 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_K 0x14ba10720 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q5_K 0x14ba10c50 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q6_K 0x14ba11180 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_rms_norm 0x14ba116b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_norm 0x14ba11d60 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_f32_f32 0x14ba12290 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_f16_f32 0x14ba127c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_1row 0x14ba12d50 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_l4 0x14ba13480 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q4_0_f32 0x14ba139b0 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q4_1_f32 0x14ba14150 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q8_0_f32 0x14ba14680 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q2_K_f32 0x14ba14bb0 | th_max = 640 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q3_K_f32 0x14ba14fc0 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q4_K_f32 0x14ba153d0 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q5_K_f32 0x14ba15900 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q6_K_f32 0x14ba15e30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x14ba16360 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x14ba16890 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x14ba16dc0 | th_max = 704 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x14ba172f0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x14ba17820 | th_max = 704 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x14ba17d50 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x14ba18280 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x14ba187b0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x14ba18ce0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x14ba19210 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_rope_f32 0x14ba19740 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_rope_f16 0x14ba19e60 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_alibi_f32 0x14ba1a390 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f16 0x14ba1a8c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f32 0x14ba1adf0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f16_f16 0x14ba1b320 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_concat 0x14ba1b850 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_sqr 0x14ba1c000 | th_max = 1024 | th_width = 32
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 364.13 MB
llama_new_context_with_model: max tensor size = 312.66 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8967.25 MB, ( 8967.88 / 21845.34)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 3202.00 MB, (12169.88 / 21845.34)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 358.02 MB, (12527.89 / 21845.34)
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 400.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x14961c220 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_add_row 0x14961c470 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul 0x14961c6c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_row 0x14961c910 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_scale 0x14960a3e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_silu 0x14960a630 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_relu 0x14961cee0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu 0x14961d130 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max 0x14961d380 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max_4 0x14961d5d0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf 0x14961d820 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf_8 0x14961da70 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f32 0x14961dcc0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f16 0x14961df10 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_0 0x14961e160 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_1 0x14961e3b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q8_0 0x14961e600 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q2_K 0x14961e850 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q3_K 0x14961eaa0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_K 0x14961ecf0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q5_K 0x14961ef40 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q6_K 0x149617a30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_rms_norm 0x149617c80 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_norm 0x149617ed0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_f32_f32 0x149618120 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_f16_f32 0x149618370 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_1row 0x1496185c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_l4 0x149618810 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q4_0_f32 0x14961f190 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q4_1_f32 0x14961f3e0 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q8_0_f32 0x14961f630 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q2_K_f32 0x14961f880 | th_max = 640 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q3_K_f32 0x14961fad0 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q4_K_f32 0x14961fd20 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q5_K_f32 0x14961ff70 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q6_K_f32 0x1496201c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x149620410 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x149620660 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x1496208b0 | th_max = 704 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x149620b00 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x149620d50 | th_max = 704 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x149620fa0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x1496211f0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x149621440 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x149621690 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x1496218e0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_rope_f32 0x149621b30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_rope_f16 0x149621d80 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_alibi_f32 0x149621fd0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f16 0x149622220 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f32 0x149622470 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f16_f16 0x1496226c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_concat 0x149622910 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_sqr 0x149622b60 | th_max = 1024 | th_width = 32
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 81.13 MB
llama_new_context_with_model: max tensor size = 312.66 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8967.25 MB, (21495.14 / 21845.34)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, (21897.14 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 75.02 MB, (21972.16 / 21845.34), warning: current allocated size is greater than the recommended max working set size
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 400.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x14b808fb0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_add_row 0x14b809200 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul 0x14b809450 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_row 0x14b8096a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_scale 0x14b804080 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_silu 0x14b8042d0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_relu 0x14b804520 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu 0x14b804770 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max 0x14b906f90 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max_4 0x14b906390 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf 0x14b9065e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf_8 0x14b907570 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f32 0x14b9077c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f16 0x14b907f60 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_0 0x14b9081b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_1 0x14b908400 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q8_0 0x14b908650 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q2_K 0x14b9088a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q3_K 0x14b908af0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_K 0x14b908d40 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q5_K 0x14b908f90 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q6_K 0x14b9091e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_rms_norm 0x14b909430 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_norm 0x14b9044e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_f32_f32 0x14b904730 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_f16_f32 0x14b904980 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_1row 0x14b904bd0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_f16_f32_l4 0x14b8049c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q4_0_f32 0x14b804c10 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q4_1_f32 0x14b804e60 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q8_0_f32 0x14b8050b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q2_K_f32 0x14b805300 | th_max = 640 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q3_K_f32 0x14b805550 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q4_K_f32 0x14b8057a0 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q5_K_f32 0x14b8059f0 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mv_q6_K_f32 0x14b805c40 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x14b805e90 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x14b8060e0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x14b806330 | th_max = 704 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x14b806580 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x14b809c00 | th_max = 704 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x14b809e50 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x14b80a0a0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x14b80a2f0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x14b80a540 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x14b80a790 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_rope_f32 0x14ba1ca80 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_rope_f16 0x14ba1ccd0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_alibi_f32 0x14ba1cf20 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f16 0x14ba1d170 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f32 0x14ba1d3c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f16_f16 0x14ba1d610 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_concat 0x14ba1d860 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_sqr 0x14ba1dab0 | th_max = 1024 | th_width = 32
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 81.13 MB
llama_new_context_with_model: max tensor size = 312.66 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8967.25 MB, (30939.41 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, (31341.41 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 75.02 MB, (31416.42 / 21845.34), warning: current allocated size is greater than the recommended max working set size
GGML_ASSERT: /Users/user/Documents/workspace/llama.cpp/llama.cpp:9084: ctx->logits.capacity() == logits_cap