Skip to content

Bug: Occasional crashes when a connection has been interrupted before completion of computation #9928

Closed
@sliedes

Description

@sliedes

What happened?

I am running llama-server like this:

llama-server -c 102400 -ngl 100 -m Replete-LLM-V2.5-Qwen-14b-IQ3_M.gguf --chat-template chatml --check-tensors -ctk q8_0 -ctv q8_0 -fa --parallel 10

When I make a number of /completion calls, then close those connections without waiting for response (e.g. terminating the connecting process), llama-server often crashes with

/build/source/ggml/src/ggml-cuda.cu:70: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /build/source/ggml/src/ggml-cuda.cu:2446
  cudaStreamSynchronize(cuda_ctx->stream())

.

I've been trying to build it with -DCMAKE_BUILD_TYPE=Debug, but for some reason I'm still seeing "variable optimized out" in my gdb; I don't quite know what's going on there... Either I or Nix may be doing something fishy. The binary definitely is the debug version since the debug info is present.

GDB output:

$ coredumpctl debug
           PID: 1067626 (llama-server)
           UID: 1000 (sliedes)
           GID: 100 (users)
        Signal: 6 (ABRT)
     Timestamp: Thu 2024-10-17 16:54:42 CEST (5min ago)
  Command Line: llama-server -c 102400 -ngl 100 -m Replete-LLM-V2.5-Qwen-14b-IQ3_M.gguf --chat-template chatml --check-tensors -ctk q8_0 -ctv q8_0 -fa --parallel 10
    Executable: /nix/store/xsjknx60if36j5d8kl393yb23hhf76ic-llama-cpp-3933/bin/llama-server
 Control Group: /user.slice/user-1000.slice/session-12.scope
          Unit: session-12.scope
         Slice: user-1000.slice
       Session: 12
     Owner UID: 1000 (sliedes)
       Boot ID: 6224b3f52c0e45468c99f5f5cc1d17f4
    Machine ID: 13629c48106c49a39ea48f0b10557f82
      Hostname: poyta
       Storage: /var/lib/systemd/coredump/core.llama-server.1000.6224b3f52c0e45468c99f5f5cc1d17f4.1067626.1729176882000000.zst (present)
  Size on Disk: 234.9M
       Message: Process 1067626 (llama-server) of user 1000 dumped core.

                Module libgomp.so.1 without build-id.
                Module libgcc_s.so.1 without build-id.
                Module libstdc++.so.6 without build-id.
                Stack trace of thread 1067626:
                #0  0x00007ffff329b7dc __pthread_kill_implementation (libc.so.6 + 0x927dc)
                #1  0x00007ffff3249516 raise (libc.so.6 + 0x40516)
                #2  0x00007ffff3231935 abort (libc.so.6 + 0x28935)
                #3  0x00007ffff381c7c5 ggml_abort.cold (libggml.so + 0x1c7c5)
                #4  0x00007ffff38ea863 _Z15ggml_cuda_errorPKcS0_S0_iS0_ (libggml.so + 0xea863)
                #5  0x00007ffff38eb80a _ZL29ggml_backend_cuda_synchronizeP12ggml_backend (libggml.so + 0xeb80a)
                #6  0x00007ffff38759e6 ggml_backend_sched_synchronize (libggml.so + 0x759e6)
                #7  0x00007ffff3877873 ggml_backend_sched_reserve (libggml.so + 0x77873)
                #8  0x00007ffff7e90076 _ZL30llama_kv_cache_update_internalR13llama_context (libllama.so + 0x70076)
                #9  0x00007ffff7e96c53 llama_decode (libllama.so + 0x76c53)
                #10 0x000000000049fd82 _ZN14server_context12update_slotsEv (llama-server + 0xa0d82)
                #11 0x0000000000487e99 _ZN12server_queue10start_loopEv (llama-server + 0x88e99)
                #12 0x000000000042644e main (llama-server + 0x2744e)
                #13 0x00007ffff323314e __libc_start_call_main (libc.so.6 + 0x2a14e)
                #14 0x00007ffff3233209 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2a209)
                #15 0x0000000000428095 _start (llama-server + 0x29095)

                Stack trace of thread 1067627:
                #0  0x00007ffff330ad1f __poll (libc.so.6 + 0x101d1f)
                #1  0x00007fffcc254e3f n/a (libcuda.so.1 + 0x254e3f)
                #2  0x00007fffcc327fbf n/a (libcuda.so.1 + 0x327fbf)
                #3  0x00007fffcc251113 n/a (libcuda.so.1 + 0x251113)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067637:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067638:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067636:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067639:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067631:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x00000000004f662b _ZZN10common_log6resumeEvENKUlvE_clEv (llama-server + 0xf762b)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067643:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067646:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067644:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067647:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067645:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067649:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067635:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x000000000048683b _ZN15server_response4recvERKSt13unordered_setIiSt4hashIiESt8equal_toIiESaIiEE (llama-server + 0x8783b)
                #3  0x000000000049ea13 _ZN14server_context20receive_cmpl_resultsERKSt13unordered_setIiSt4hashIiESt8equal_toIiESaIiEERKSt8functionIFvRSt6vectorI18server_task_resultSaISB_EEEERKS9_IFvN8nlohmann16json_abi_v3_11_310basic_jsonINSK_11ordered_mapESA_NSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNSK_14adl_serializerESA_IhSaIhEEvEEEE (llama-server + 0x9fa13)
                #4  0x000000000043e50c _ZZ4mainENKUl21server_task_cmpl_typeRN8nlohmann16json_abi_v3_11_310basic_jsonINS1_11ordered_mapESt6vectorNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNS1_14adl_serializerES4_IhSaIhEEvEERN7httplib8ResponseEE_clES_SF_SI_ (llama-server + 0x3f50c)
                #5  0x000000000043e702 _ZNSt17_Function_handlerIFvRKN7httplib7RequestERNS0_8ResponseEEZ4mainEUlS3_S5_E10_E9_M_invokeERKSt9_Any_dataS3_S5_ (llama-server + 0x3f702)
                #6  0x0000000000446911 _ZNK7httplib6Server16dispatch_requestERNS_7RequestERNS_8ResponseERKSt6vectorISt4pairISt10unique_ptrINS_6detail11MatcherBaseESt14default_deleteIS9_EESt8functionIFvRKS1_S4_EEESaISI_EE.isra.0 (llama-server + 0x47911)
                #7  0x00000000004b164e _ZN7httplib6Server15process_requestERNS_6StreamEbRbRKSt8functionIFvRNS_7RequestEEE (llama-server + 0xb264e)
                #8  0x00000000004b1e9e _ZN7httplib6detail26process_server_socket_coreIZNS0_21process_server_socketIZNS_6Server24process_and_close_socketEiEUlRNS_6StreamEbRbE_EEbRKSt6atomicIiEimlllllT_EUlbS6_E_EEbSB_imlSC_ (llama-server + 0xb2e9e)
                #9  0x00000000004b2178 _ZNSt17_Function_handlerIFvvEZN7httplib6Server15listen_internalEvEUlvE0_E9_M_invokeERKSt9_Any_data (llama-server + 0xb3178)
                #10 0x00000000004529bc _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x539bc)
                #11 0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #12 0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #13 0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067648:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067654:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067640:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x000000000048683b _ZN15server_response4recvERKSt13unordered_setIiSt4hashIiESt8equal_toIiESaIiEE (llama-server + 0x8783b)
                #3  0x000000000049ea13 _ZN14server_context20receive_cmpl_resultsERKSt13unordered_setIiSt4hashIiESt8equal_toIiESaIiEERKSt8functionIFvRSt6vectorI18server_task_resultSaISB_EEEERKS9_IFvN8nlohmann16json_abi_v3_11_310basic_jsonINSK_11ordered_mapESA_NSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNSK_14adl_serializerESA_IhSaIhEEvEEEE (llama-server + 0x9fa13)
                #4  0x000000000043e50c _ZZ4mainENKUl21server_task_cmpl_typeRN8nlohmann16json_abi_v3_11_310basic_jsonINS1_11ordered_mapESt6vectorNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNS1_14adl_serializerES4_IhSaIhEEvEERN7httplib8ResponseEE_clES_SF_SI_ (llama-server + 0x3f50c)
                #5  0x000000000043e702 _ZNSt17_Function_handlerIFvRKN7httplib7RequestERNS0_8ResponseEEZ4mainEUlS3_S5_E10_E9_M_invokeERKSt9_Any_dataS3_S5_ (llama-server + 0x3f702)
                #6  0x0000000000446911 _ZNK7httplib6Server16dispatch_requestERNS_7RequestERNS_8ResponseERKSt6vectorISt4pairISt10unique_ptrINS_6detail11MatcherBaseESt14default_deleteIS9_EESt8functionIFvRKS1_S4_EEESaISI_EE.isra.0 (llama-server + 0x47911)
                #7  0x00000000004b164e _ZN7httplib6Server15process_requestERNS_6StreamEbRbRKSt8functionIFvRNS_7RequestEEE (llama-server + 0xb264e)
                #8  0x00000000004b1e9e _ZN7httplib6detail26process_server_socket_coreIZNS0_21process_server_socketIZNS_6Server24process_and_close_socketEiEUlRNS_6StreamEbRbE_EEbRKSt6atomicIiEimlllllT_EUlbS6_E_EEbSB_imlSC_ (llama-server + 0xb2e9e)
                #9  0x00000000004b2178 _ZNSt17_Function_handlerIFvvEZN7httplib6Server15listen_internalEvEUlvE0_E9_M_invokeERKSt9_Any_data (llama-server + 0xb3178)
                #10 0x00000000004529bc _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x539bc)
                #11 0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #12 0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #13 0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067633:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067655:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067641:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x000000000048683b _ZN15server_response4recvERKSt13unordered_setIiSt4hashIiESt8equal_toIiESaIiEE (llama-server + 0x8783b)
                #3  0x000000000049ea13 _ZN14server_context20receive_cmpl_resultsERKSt13unordered_setIiSt4hashIiESt8equal_toIiESaIiEERKSt8functionIFvRSt6vectorI18server_task_resultSaISB_EEEERKS9_IFvN8nlohmann16json_abi_v3_11_310basic_jsonINSK_11ordered_mapESA_NSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNSK_14adl_serializerESA_IhSaIhEEvEEEE (llama-server + 0x9fa13)
                #4  0x000000000043e50c _ZZ4mainENKUl21server_task_cmpl_typeRN8nlohmann16json_abi_v3_11_310basic_jsonINS1_11ordered_mapESt6vectorNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNS1_14adl_serializerES4_IhSaIhEEvEERN7httplib8ResponseEE_clES_SF_SI_ (llama-server + 0x3f50c)
                #5  0x000000000043e702 _ZNSt17_Function_handlerIFvRKN7httplib7RequestERNS0_8ResponseEEZ4mainEUlS3_S5_E10_E9_M_invokeERKSt9_Any_dataS3_S5_ (llama-server + 0x3f702)
                #6  0x0000000000446911 _ZNK7httplib6Server16dispatch_requestERNS_7RequestERNS_8ResponseERKSt6vectorISt4pairISt10unique_ptrINS_6detail11MatcherBaseESt14default_deleteIS9_EESt8functionIFvRKS1_S4_EEESaISI_EE.isra.0 (llama-server + 0x47911)
                #7  0x00000000004b164e _ZN7httplib6Server15process_requestERNS_6StreamEbRbRKSt8functionIFvRNS_7RequestEEE (llama-server + 0xb264e)
                #8  0x00000000004b1e9e _ZN7httplib6detail26process_server_socket_coreIZNS0_21process_server_socketIZNS_6Server24process_and_close_socketEiEUlRNS_6StreamEbRbE_EEbRKSt6atomicIiEimlllllT_EUlbS6_E_EEbSB_imlSC_ (llama-server + 0xb2e9e)
                #9  0x00000000004b2178 _ZNSt17_Function_handlerIFvvEZN7httplib6Server15listen_internalEvEUlvE0_E9_M_invokeERKSt9_Any_data (llama-server + 0xb3178)
                #10 0x00000000004529bc _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x539bc)
                #11 0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #12 0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #13 0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067651:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067660:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067642:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067634:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067632:
                #0  0x00007ffff331b01f accept (libc.so.6 + 0x11201f)
                #1  0x000000000042d0e4 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZ4mainEUlvE1_EEEEE6_M_runEv (llama-server + 0x2e0e4)
                #2  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #3  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #4  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067650:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067665:
                #0  0x00007ffff330ad1f __poll (libc.so.6 + 0x101d1f)
                #1  0x00007fffcc254e3f n/a (libcuda.so.1 + 0x254e3f)
                #2  0x00007fffcc327fbf n/a (libcuda.so.1 + 0x327fbf)
                #3  0x00007fffcc251113 n/a (libcuda.so.1 + 0x251113)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067657:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067664:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298f45 pthread_cond_timedwait@@GLIBC_2.3.2 (libc.so.6 + 0x8ff45)
                #2  0x00007fffcc1aebca n/a (libcuda.so.1 + 0x1aebca)
                #3  0x00007fffcc251113 n/a (libcuda.so.1 + 0x251113)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067658:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067661:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067652:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067653:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067656:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067662:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067659:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067663:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)
                ELF object binary architecture: AMD x86-64

Reading symbols from /nix/store/xsjknx60if36j5d8kl393yb23hhf76ic-llama-cpp-3933/bin/llama-server...

warning: Loadable section ".dynstr" outside of ELF segments
  in /nix/store/xsjknx60if36j5d8kl393yb23hhf76ic-llama-cpp-3933/bin/llama-server
Reading symbols from /nix/store/jvyl2rg6mff5c6z3477sbip03w86rwjw-llama-cpp-3933-debug/lib/debug/.build-id/db/86f367231952a378ab1268136a29fb91d5a98b.debug...

warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing
[New LWP 1067626]
<... snipped text...>
[New LWP 1067663]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/3dyw8dzj9ab4m8hv5dpyx7zii8d0w6fi-glibc-2.39-52/lib/libthread_db.so.1".
Core was generated by `llama-server -c 102400 -ngl 100 -m Replete-LLM-V2.5-Qwen-14b-IQ3_M.gguf --chat-'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0)
    at pthread_kill.c:44
44	      return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
[Current thread is 1 (Thread 0x7ffff3787000 (LWP 1067626))]
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0)
    at pthread_kill.c:44
#1  0x00007ffff329b843 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff3249516 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff3231935 in __GI_abort () at abort.c:79
#4  0x00007ffff381c7c5 in ggml_abort (file=0x7ffff3a3e7e0 "/build/source/ggml/src/ggml-cuda.cu", line=70,
    fmt=0x7ffff3a32d51 "CUDA error") at /build/source/ggml/src/ggml.c:305
#5  0x00007ffff38ea863 in ggml_cuda_error (stmt=stmt@entry=0x7ffff3a3f1c0 "cudaStreamSynchronize(cuda_ctx->stream())",
    func=func@entry=0x7ffff3a32e36 "ggml_backend_cuda_synchronize",
    file=file@entry=0x7ffff3a3e7e0 "/build/source/ggml/src/ggml-cuda.cu", line=line@entry=2446,
    msg=0x7ffff2e8db00 "an illegal memory access was encountered") at /build/source/ggml/src/ggml-cuda.cu:70
#6  0x00007ffff38eb80a in ggml_backend_cuda_synchronize (backend=<optimized out>)
    at /build/source/ggml/src/ggml-cuda.cu:2446
#7  0x00007ffff38759e6 in ggml_backend_sched_synchronize (sched=sched@entry=0x127e630)
    at /build/source/ggml/src/ggml-backend.cpp:2349
#8  0x00007ffff3877873 in ggml_backend_sched_reserve (sched=0x127e630, measure_graph=<optimized out>)
    at /build/source/ggml/src/ggml-backend.cpp:2307
#9  0x00007ffff7e90076 in llama_kv_cache_update_internal (lctx=...) at /build/source/src/llama.cpp:17891
#10 0x00007ffff7e90c25 in llama_kv_cache_update (ctx=<optimized out>) at /build/source/src/llama.cpp:20123
#11 0x00007ffff7e96c53 in llama_decode_internal (batch_all=..., lctx=...) at /build/source/src/llama.cpp:17248
#12 llama_decode (ctx=0x1269150, batch=...) at /build/source/src/llama.cpp:21200
#13 0x000000000049fd82 in server_context::update_slots (this=<optimized out>)
    at /build/source/examples/server/server.cpp:2292
#14 0x0000000000487e99 in std::function<void()>::operator() (this=0x7fffffffb1a8)
    at /nix/store/6mmwy4jcnqnhms3i56r1hbdn656akg1d-gcc-13.3.0/include/c++/13.3.0/bits/std_function.h:591
#15 server_queue::start_loop (this=this@entry=0x7fffffffb088) at /build/source/examples/server/server.cpp:504
#16 0x000000000042644e in main (argc=<optimized out>, argv=<optimized out>)
    at /build/source/examples/server/server.cpp:3402
(gdb) set substitute-path /build/source /home/sliedes/proj/llama.cpp
(gdb) fra 4
#4  0x00007ffff381c7c5 in ggml_abort (file=0x7ffff3a3e7e0 "/build/source/ggml/src/ggml-cuda.cu", line=70,
    fmt=0x7ffff3a32d51 "CUDA error") at /build/source/ggml/src/ggml.c:305
305	    abort();
(gdb) q

Name and Version

In reality, this is b3933 (f010b77) on NixOS; the build scripts seem to report version 0:


ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 4090)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen Threadripper PRO 5955WX 16-Cores)
version: 0 (unknown)
built with gcc (GCC) 13.3.0 for x86_64-unknown-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

$ llama-server -c 102400 -ngl 100 -m Replete-LLM-V2.5-Qwen-14b-IQ3_M.gguf --chat-template chatml --check-tensors -ctk q8_0 -ctv q8_0 -fa --parallel 10
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 4090)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen Threadripper PRO 5955WX 16-Cores)
build: 0 (unknown) with gcc (GCC) 13.3.0 for x86_64-unknown-linux-gnu (debug)
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 31
main: loading model
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 18924 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from Replete-LLM-V2.5-Qwen-14b-IQ3_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Replete LLM V2.5 Qwen 14b
llama_model_loader: - kv   3:                           general.basename str              = Replete-LLM-V2.5-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 14B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                   general.base_model.count u32              = 1
llama_model_loader: - kv   7:                  general.base_model.0.name str              = Qwen2.5 14B Instruct
llama_model_loader: - kv   8:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv   9:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-1...
llama_model_loader: - kv  10:                          qwen2.block_count u32              = 48
llama_model_loader: - kv  11:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  12:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv  13:                  qwen2.feed_forward_length u32              = 13824
llama_model_loader: - kv  14:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  15:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  16:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  17:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                          general.file_type u32              = 27
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                      quantize.imatrix.file str              = /models_out/Replete-LLM-V2.5-Qwen-14b...
llama_model_loader: - kv  31:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  32:             quantize.imatrix.entries_count i32              = 336
llama_model_loader: - kv  33:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q4_K:  102 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  235 tensors
llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG
llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG
llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG
llm_load_vocab: control token: 151648 '<|box_start|>' is not marked as EOG
llm_load_vocab: control token: 151646 '<|object_ref_start|>' is not marked as EOG
llm_load_vocab: control token: 151649 '<|box_end|>' is not marked as EOG
llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG
llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG
llm_load_vocab: control token: 151647 '<|object_ref_end|>' is not marked as EOG
llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG
llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG
llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG
llm_load_vocab: control token: 151644 '<|im_start|>' is not marked as EOG
llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG
llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ3_S mix - 3.66 bpw
llm_load_print_meta: model params     = 14.77 B
llm_load_print_meta: model size       = 6.44 GiB (3.74 BPW)
llm_load_print_meta: general.name     = Replete LLM V2.5 Qwen 14b
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.51 MiB
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors:        CPU buffer size =   319.04 MiB
llm_load_tensors:      CUDA0 buffer size =  6271.39 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 102400
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size = 10200.00 MiB
llama_new_context_with_model: KV self size  = 10200.00 MiB, K (q8_0): 5100.00 MiB, V (q8_0): 5100.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     6.38 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   340.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   210.01 MiB
llama_new_context_with_model: graph nodes  = 1495
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 10
slot         init: id  0 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  1 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  2 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  3 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  4 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  5 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  6 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  7 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  8 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  9 | task -1 | new slot n_ctx_slot = 10240
main: model loaded
main: chat template, built_in: 0, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on 127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
request: GET /props 127.0.0.1 200
request: POST /tokenize 127.0.0.1 200
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | tokenizing prompt, len = 1
slot update_slots: id  0 | task 0 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8594
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.238306
slot launch_slot_: id  1 | task 2 | processing task
slot launch_slot_: id  2 | task 3 | processing task
slot launch_slot_: id  3 | task 4 | processing task
slot launch_slot_: id  4 | task 5 | processing task
slot launch_slot_: id  5 | task 6 | processing task
slot launch_slot_: id  6 | task 7 | processing task
slot launch_slot_: id  7 | task 8 | processing task
slot launch_slot_: id  8 | task 9 | processing task
slot launch_slot_: id  9 | task 10 | processing task
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.476612
slot update_slots: id  0 | task 0 | kv cache rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.714917
slot update_slots: id  0 | task 0 | kv cache rm [6144, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.953223
slot update_slots: id  0 | task 0 | kv cache rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 8594, n_tokens = 402, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 8594, n_tokens = 402
slot update_slots: id  1 | task 2 | tokenizing prompt, len = 1
slot update_slots: id  1 | task 2 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8585
slot update_slots: id  1 | task 2 | kv cache rm [0, end)
slot update_slots: id  1 | task 2 | prompt processing progress, n_past = 1646, n_tokens = 2048, progress = 0.191730
slot update_slots: id  1 | task 2 | kv cache rm [1646, end)
slot update_slots: id  1 | task 2 | prompt processing progress, n_past = 3693, n_tokens = 2048, progress = 0.430169
slot update_slots: id  1 | task 2 | kv cache rm [3693, end)
slot update_slots: id  1 | task 2 | prompt processing progress, n_past = 5740, n_tokens = 2048, progress = 0.668608
slot update_slots: id  1 | task 2 | kv cache rm [5740, end)
slot update_slots: id  1 | task 2 | prompt processing progress, n_past = 7787, n_tokens = 2048, progress = 0.907047
slot update_slots: id  1 | task 2 | kv cache rm [7787, end)
slot update_slots: id  1 | task 2 | prompt processing progress, n_past = 8585, n_tokens = 799, progress = 1.000000
slot update_slots: id  1 | task 2 | prompt done, n_past = 8585, n_tokens = 799
slot update_slots: id  2 | task 3 | tokenizing prompt, len = 1
slot update_slots: id  2 | task 3 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8562
slot update_slots: id  2 | task 3 | kv cache rm [0, end)
slot update_slots: id  2 | task 3 | prompt processing progress, n_past = 1249, n_tokens = 2048, progress = 0.145877
slot update_slots: id  2 | task 3 | kv cache rm [1249, end)
slot update_slots: id  2 | task 3 | prompt processing progress, n_past = 3295, n_tokens = 2048, progress = 0.384840
slot update_slots: id  2 | task 3 | kv cache rm [3295, end)
slot update_slots: id  2 | task 3 | prompt processing progress, n_past = 5341, n_tokens = 2048, progress = 0.623803
slot update_slots: id  2 | task 3 | kv cache rm [5341, end)
slot update_slots: id  2 | task 3 | prompt processing progress, n_past = 7387, n_tokens = 2048, progress = 0.862766
slot update_slots: id  2 | task 3 | kv cache rm [7387, end)
slot update_slots: id  2 | task 3 | prompt processing progress, n_past = 8562, n_tokens = 1177, progress = 1.000000
slot update_slots: id  2 | task 3 | prompt done, n_past = 8562, n_tokens = 1177
slot update_slots: id  3 | task 4 | tokenizing prompt, len = 1
slot update_slots: id  3 | task 4 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8457
slot update_slots: id  3 | task 4 | kv cache rm [0, end)
slot update_slots: id  3 | task 4 | prompt processing progress, n_past = 871, n_tokens = 2048, progress = 0.102992
slot update_slots: id  3 | task 4 | kv cache rm [871, end)
slot update_slots: id  3 | task 4 | prompt processing progress, n_past = 2916, n_tokens = 2048, progress = 0.344803
slot update_slots: id  3 | task 4 | kv cache rm [2916, end)
slot update_slots: id  3 | task 4 | prompt processing progress, n_past = 4961, n_tokens = 2048, progress = 0.586615
slot update_slots: id  3 | task 4 | kv cache rm [4961, end)
slot update_slots: id  3 | task 4 | prompt processing progress, n_past = 7006, n_tokens = 2048, progress = 0.828426
slot update_slots: id  3 | task 4 | kv cache rm [7006, end)
slot update_slots: id  3 | task 4 | prompt processing progress, n_past = 8457, n_tokens = 1454, progress = 1.000000
slot update_slots: id  3 | task 4 | prompt done, n_past = 8457, n_tokens = 1454
slot update_slots: id  4 | task 5 | tokenizing prompt, len = 1
slot update_slots: id  4 | task 5 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 9007
slot update_slots: id  4 | task 5 | kv cache rm [0, end)
slot update_slots: id  4 | task 5 | prompt processing progress, n_past = 594, n_tokens = 2048, progress = 0.065949
slot update_slots: id  4 | task 5 | kv cache rm [594, end)
slot update_slots: id  4 | task 5 | prompt processing progress, n_past = 2638, n_tokens = 2048, progress = 0.292883
slot update_slots: id  4 | task 5 | kv cache rm [2638, end)
slot update_slots: id  4 | task 5 | prompt processing progress, n_past = 4682, n_tokens = 2048, progress = 0.519818
slot update_slots: id  4 | task 5 | kv cache rm [4682, end)
slot update_slots: id  4 | task 5 | prompt processing progress, n_past = 6726, n_tokens = 2048, progress = 0.746753
slot update_slots: id  4 | task 5 | kv cache rm [6726, end)
slot update_slots: id  4 | task 5 | prompt processing progress, n_past = 8770, n_tokens = 2048, progress = 0.973687
slot update_slots: id  4 | task 5 | kv cache rm [8770, end)
slot update_slots: id  4 | task 5 | prompt processing progress, n_past = 9007, n_tokens = 241, progress = 1.000000
slot update_slots: id  4 | task 5 | prompt done, n_past = 9007, n_tokens = 241
slot update_slots: id  5 | task 6 | tokenizing prompt, len = 1
slot update_slots: id  5 | task 6 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8853
slot update_slots: id  5 | task 6 | kv cache rm [0, end)
slot update_slots: id  5 | task 6 | prompt processing progress, n_past = 1807, n_tokens = 2048, progress = 0.204112
slot update_slots: id  5 | task 6 | kv cache rm [1807, end)
slot update_slots: id  5 | task 6 | prompt processing progress, n_past = 3850, n_tokens = 2048, progress = 0.434881
slot update_slots: id  5 | task 6 | kv cache rm [3850, end)
slot update_slots: id  5 | task 6 | prompt processing progress, n_past = 5893, n_tokens = 2048, progress = 0.665650
slot update_slots: id  5 | task 6 | kv cache rm [5893, end)
slot update_slots: id  5 | task 6 | prompt processing progress, n_past = 7936, n_tokens = 2048, progress = 0.896419
slot update_slots: id  5 | task 6 | kv cache rm [7936, end)
slot update_slots: id  5 | task 6 | prompt processing progress, n_past = 8853, n_tokens = 922, progress = 1.000000
slot update_slots: id  5 | task 6 | prompt done, n_past = 8853, n_tokens = 922
slot update_slots: id  6 | task 7 | tokenizing prompt, len = 1
slot update_slots: id  6 | task 7 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 9213
slot update_slots: id  6 | task 7 | kv cache rm [0, end)
slot update_slots: id  6 | task 7 | prompt processing progress, n_past = 1126, n_tokens = 2048, progress = 0.122219
slot update_slots: id  6 | task 7 | kv cache rm [1126, end)
slot update_slots: id  6 | task 7 | prompt processing progress, n_past = 3168, n_tokens = 2048, progress = 0.343862
slot update_slots: id  6 | task 7 | kv cache rm [3168, end)
slot update_slots: id  6 | task 7 | prompt processing progress, n_past = 5210, n_tokens = 2048, progress = 0.565505
slot update_slots: id  6 | task 7 | kv cache rm [5210, end)
slot update_slots: id  6 | task 7 | prompt processing progress, n_past = 7252, n_tokens = 2048, progress = 0.787149
slot update_slots: id  6 | task 7 | kv cache rm [7252, end)
slot update_slots: id  6 | task 7 | prompt processing progress, n_past = 9213, n_tokens = 1967, progress = 1.000000
slot update_slots: id  6 | task 7 | prompt done, n_past = 9213, n_tokens = 1967
slot update_slots: id  7 | task 8 | tokenizing prompt, len = 1
slot update_slots: id  7 | task 8 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 9446
slot update_slots: id  7 | task 8 | kv cache rm [0, end)
slot update_slots: id  7 | task 8 | prompt processing progress, n_past = 81, n_tokens = 2048, progress = 0.008575
slot update_slots: id  7 | task 8 | kv cache rm [81, end)
slot update_slots: id  7 | task 8 | prompt processing progress, n_past = 2122, n_tokens = 2048, progress = 0.224645
slot update_slots: id  7 | task 8 | kv cache rm [2122, end)
slot update_slots: id  7 | task 8 | prompt processing progress, n_past = 4163, n_tokens = 2048, progress = 0.440716
slot update_slots: id  7 | task 8 | kv cache rm [4163, end)
slot update_slots: id  7 | task 8 | prompt processing progress, n_past = 6204, n_tokens = 2048, progress = 0.656786
slot update_slots: id  7 | task 8 | kv cache rm [6204, end)
slot update_slots: id  7 | task 8 | prompt processing progress, n_past = 8245, n_tokens = 2048, progress = 0.872856
slot update_slots: id  7 | task 8 | kv cache rm [8245, end)
slot update_slots: id  7 | task 8 | prompt processing progress, n_past = 9446, n_tokens = 1208, progress = 1.000000
slot update_slots: id  7 | task 8 | prompt done, n_past = 9446, n_tokens = 1208
slot update_slots: id  8 | task 9 | tokenizing prompt, len = 1
slot update_slots: id  8 | task 9 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8661
slot update_slots: id  8 | task 9 | kv cache rm [0, end)
slot update_slots: id  8 | task 9 | prompt processing progress, n_past = 840, n_tokens = 2048, progress = 0.096986
slot update_slots: id  8 | task 9 | kv cache rm [840, end)
slot update_slots: id  8 | task 9 | prompt processing progress, n_past = 2880, n_tokens = 2048, progress = 0.332525
slot update_slots: id  8 | task 9 | kv cache rm [2880, end)
slot update_slots: id  8 | task 9 | prompt processing progress, n_past = 4920, n_tokens = 2048, progress = 0.568064
slot update_slots: id  8 | task 9 | kv cache rm [4920, end)
slot update_slots: id  8 | task 9 | prompt processing progress, n_past = 6960, n_tokens = 2048, progress = 0.803602
slot update_slots: id  8 | task 9 | kv cache rm [6960, end)
slot update_slots: id  8 | task 9 | prompt processing progress, n_past = 8661, n_tokens = 1709, progress = 1.000000
slot update_slots: id  8 | task 9 | prompt done, n_past = 8661, n_tokens = 1709
slot update_slots: id  9 | task 10 | tokenizing prompt, len = 1
slot update_slots: id  9 | task 10 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8390
slot update_slots: id  9 | task 10 | kv cache rm [0, end)
slot update_slots: id  9 | task 10 | prompt processing progress, n_past = 339, n_tokens = 2048, progress = 0.040405
slot update_slots: id  9 | task 10 | kv cache rm [339, end)
slot update_slots: id  9 | task 10 | prompt processing progress, n_past = 2378, n_tokens = 2048, progress = 0.283433
slot update_slots: id  9 | task 10 | kv cache rm [2378, end)
slot update_slots: id  9 | task 10 | prompt processing progress, n_past = 4417, n_tokens = 2048, progress = 0.526460
slot update_slots: id  9 | task 10 | kv cache rm [4417, end)
slot update_slots: id  9 | task 10 | prompt processing progress, n_past = 6456, n_tokens = 2048, progress = 0.769488
slot update_slots: id  9 | task 10 | kv cache rm [6456, end)
slot update_slots: id  9 | task 10 | prompt processing progress, n_past = 8390, n_tokens = 1943, progress = 1.000000
slot update_slots: id  9 | task 10 | prompt done, n_past = 8390, n_tokens = 1943
slot      release: id  8 | task 9 | stop processing: n_past = 8904, truncated = 0
slot print_timing: id  8 | task 9 |
prompt eval time =   14863.10 ms /  8661 tokens (    1.72 ms per token,   582.72 tokens per second)
       eval time =   37648.51 ms /   244 tokens (  154.30 ms per token,     6.48 tokens per second)
      total time =   52511.61 ms /  8905 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  2 | task 3 | stop processing: n_past = 9072, truncated = 0
slot print_timing: id  2 | task 3 |
prompt eval time =    5774.81 ms /  8562 tokens (    0.67 ms per token,  1482.65 tokens per second)
       eval time =  120757.56 ms /   511 tokens (  236.32 ms per token,     4.23 tokens per second)
      total time =  126532.36 ms /  9073 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  6 | task 7 | stop processing: n_past = 9708, truncated = 0
slot print_timing: id  6 | task 7 |
prompt eval time =   11687.75 ms /  9213 tokens (    1.27 ms per token,   788.26 tokens per second)
       eval time =   88451.60 ms /   496 tokens (  178.33 ms per token,     5.61 tokens per second)
      total time =  100139.35 ms /  9709 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  9 | task 10 | stop processing: n_past = 8899, truncated = 0
slot print_timing: id  9 | task 10 |
prompt eval time =   16603.01 ms /  8390 tokens (    1.98 ms per token,   505.33 tokens per second)
       eval time =   52316.64 ms /   510 tokens (  102.58 ms per token,     9.75 tokens per second)
      total time =   68919.64 ms /  8900 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  4 | task 5 | stop processing: n_past = 9582, truncated = 0
slot print_timing: id  4 | task 5 |
prompt eval time =   10398.17 ms /  9007 tokens (    1.15 ms per token,   866.21 tokens per second)
       eval time =  113429.50 ms /   576 tokens (  196.93 ms per token,     5.08 tokens per second)
      total time =  123827.67 ms /  9583 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  0 | task 0 | stop processing: n_past = 9206, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    3030.07 ms /  8594 tokens (    0.35 ms per token,  2836.24 tokens per second)
       eval time =  138377.86 ms /   613 tokens (  225.74 ms per token,     4.43 tokens per second)
      total time =  141407.93 ms /  9207 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  3 | task 4 | stop processing: n_past = 9216, truncated = 0
slot print_timing: id  3 | task 4 |
prompt eval time =    7145.12 ms /  8457 tokens (    0.84 ms per token,  1183.61 tokens per second)
       eval time =  139217.79 ms /   760 tokens (  183.18 ms per token,     5.46 tokens per second)
      total time =  146362.91 ms /  9217 tokens
request: POST /completion 127.0.0.1 200
slot update_slots: id  7 | task 8 | slot context shift, n_keep = 0, n_left = 10239, n_discard = 5119
/build/source/ggml/src/ggml-cuda.cu:70: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /build/source/ggml/src/ggml-cuda.cu:2446
  cudaStreamSynchronize(cuda_ctx->stream())

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedhigh severityUsed to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions