Closed
Description
Current Behavior
I got this crash on https://github.com/cebtenzzre/llama.cpp/tree/18fe116e9a5aa45a83bd1d6f043f98dc395f218e:
2023-11-26 20:06:04 INFO:Loaded the model in 9.14 seconds.
GGML_ASSERT: /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:5484: false
Failure Information (for bugs)
Backtrace:
#3 0x00007f5999fd54b8 in __GI_abort () at abort.c:79
#4 0x00007f585ac6b357 in ggml_mul_mat_q4_0_q8_1_cuda (stream=<optimized out>, nrows_dst=<optimized out>, nrows_y=<optimized out>, ncols_y=<optimized out>,
nrows_x=<optimized out>, ncols_x=<optimized out>, dst=<optimized out>, vy=<optimized out>, vx=<optimized out>)
at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:5076
#5 ggml_cuda_op_mul_mat_q (src0=src0@entry=0x204c00320, src1=src1@entry=0x269123d80, dst=dst@entry=0x269123f00, src0_dd_i=src0_dd_i@entry=0x90be00000 "",
src1_ddf_i=src1_ddf_i@entry=0x9b0400000, src1_ddq_i=src1_ddq_i@entry=0x9afe00000 "", dst_dd_i=0x90b420400, row_low=32000, row_high=32032, src1_ncols=512,
src1_padded_row_size=5120, stream=@0x7f5878be7fa8: 0x7f5861b127a0) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:6098
#6 0x00007f585ac641f2 in ggml_cuda_op_mul_mat (src0=0x204c00320, src1=<optimized out>, dst=<optimized out>,
op=0x7f585ac6b270 <ggml_cuda_op_mul_mat_q(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st* const&)>, convert_src1_to_q8_1=true) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:6959
#7 0x00007f585ac66023 in ggml_cuda_compute_forward (params=params@entry=0x7f5878be8560, tensor=tensor@entry=0x269123f00)
at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:7844
#8 0x00007f585ac4606e in ggml_compute_forward (tensor=0x269123f00, params=0x7f5878be8560) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml.c:14503
#9 ggml_graph_compute_thread (data=data@entry=0x7f5878be85e0) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml.c:16245
#10 0x00007f585ac4862e in ggml_graph_compute (cgraph=0x269000020, cplan=<optimized out>) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml.c:16831
#11 0x00007f585ac794b3 in ggml_graph_compute_helper (buf=std::vector of length 0, capacity 0, graph=graph@entry=0x269000020, n_threads=n_threads@entry=1)
at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/llama.cpp:592
#12 0x00007f585ac7c365 in llama_decode_internal (lctx=..., batch=...) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/llama.cpp:5194
#13 0x00007f585ac7cac8 in llama_eval (ctx=0x7f586234bff0, tokens=0x7f5862346200, n_tokens=512, n_past=0)
at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/llama.cpp:8842
#14 0x00007f5998def4f6 in ffi_call_unix64 () at ../src/x86/unix64.S:104
Relevant code: https://github.com/cebtenzzre/llama.cpp/blob/18fe116e9a5aa45a83bd1d6f043f98dc395f218e/ggml-cuda.cu#L5054-L5077
It asserts that g_compute_capabilities[id] >= MIN_CC_DP4A
(610) where id is the current device. But it is 520, which matches my GTX 970:
>>> print id
$10 = 1
>>> print g_compute_capabilities[0]
$11 = 610
>>> print g_compute_capabilities[1]
$12 = 520
Steps to Reproduce
I'm not exactly sure how I ran into this issue, because I've been using the same build for weeks without seeing it. It could be an issue with my fork - I should investigate whether the latest llama.cpp is still significantly slower on my GPUs. I still have the coredump handy if any further information would help.
cc @slaren