Bug: Occasional crashes when a connection has been interrupted before completion of computation

### What happened?

I am running llama-server like this:

`llama-server -c 102400 -ngl 100 -m Replete-LLM-V2.5-Qwen-14b-IQ3_M.gguf --chat-template chatml --check-tensors -ctk q8_0 -ctv q8_0 -fa --parallel 10`

When I make a number of `/completion` calls, then close those connections without waiting for response (e.g. terminating the connecting process), llama-server often crashes with

```
/build/source/ggml/src/ggml-cuda.cu:70: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /build/source/ggml/src/ggml-cuda.cu:2446
  cudaStreamSynchronize(cuda_ctx->stream())
```

.

I've been trying to build it with `-DCMAKE_BUILD_TYPE=Debug`, but for some reason I'm still seeing "variable optimized out" in my gdb; I don't quite know what's going on there... Either I or Nix may be doing something fishy. The binary definitely is the debug version since the debug info is present.

GDB output:

```
$ coredumpctl debug
           PID: 1067626 (llama-server)
           UID: 1000 (sliedes)
           GID: 100 (users)
        Signal: 6 (ABRT)
     Timestamp: Thu 2024-10-17 16:54:42 CEST (5min ago)
  Command Line: llama-server -c 102400 -ngl 100 -m Replete-LLM-V2.5-Qwen-14b-IQ3_M.gguf --chat-template chatml --check-tensors -ctk q8_0 -ctv q8_0 -fa --parallel 10
    Executable: /nix/store/xsjknx60if36j5d8kl393yb23hhf76ic-llama-cpp-3933/bin/llama-server
 Control Group: /user.slice/user-1000.slice/session-12.scope
          Unit: session-12.scope
         Slice: user-1000.slice
       Session: 12
     Owner UID: 1000 (sliedes)
       Boot ID: 6224b3f52c0e45468c99f5f5cc1d17f4
    Machine ID: 13629c48106c49a39ea48f0b10557f82
      Hostname: poyta
       Storage: /var/lib/systemd/coredump/core.llama-server.1000.6224b3f52c0e45468c99f5f5cc1d17f4.1067626.1729176882000000.zst (present)
  Size on Disk: 234.9M
       Message: Process 1067626 (llama-server) of user 1000 dumped core.

                Module libgomp.so.1 without build-id.
                Module libgcc_s.so.1 without build-id.
                Module libstdc++.so.6 without build-id.
                Stack trace of thread 1067626:
                #0  0x00007ffff329b7dc __pthread_kill_implementation (libc.so.6 + 0x927dc)
                #1  0x00007ffff3249516 raise (libc.so.6 + 0x40516)
                #2  0x00007ffff3231935 abort (libc.so.6 + 0x28935)
                #3  0x00007ffff381c7c5 ggml_abort.cold (libggml.so + 0x1c7c5)
                #4  0x00007ffff38ea863 _Z15ggml_cuda_errorPKcS0_S0_iS0_ (libggml.so + 0xea863)
                #5  0x00007ffff38eb80a _ZL29ggml_backend_cuda_synchronizeP12ggml_backend (libggml.so + 0xeb80a)
                #6  0x00007ffff38759e6 ggml_backend_sched_synchronize (libggml.so + 0x759e6)
                #7  0x00007ffff3877873 ggml_backend_sched_reserve (libggml.so + 0x77873)
                #8  0x00007ffff7e90076 _ZL30llama_kv_cache_update_internalR13llama_context (libllama.so + 0x70076)
                #9  0x00007ffff7e96c53 llama_decode (libllama.so + 0x76c53)
                #10 0x000000000049fd82 _ZN14server_context12update_slotsEv (llama-server + 0xa0d82)
                #11 0x0000000000487e99 _ZN12server_queue10start_loopEv (llama-server + 0x88e99)
                #12 0x000000000042644e main (llama-server + 0x2744e)
                #13 0x00007ffff323314e __libc_start_call_main (libc.so.6 + 0x2a14e)
                #14 0x00007ffff3233209 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2a209)
                #15 0x0000000000428095 _start (llama-server + 0x29095)

                Stack trace of thread 1067627:
                #0  0x00007ffff330ad1f __poll (libc.so.6 + 0x101d1f)
                #1  0x00007fffcc254e3f n/a (libcuda.so.1 + 0x254e3f)
                #2  0x00007fffcc327fbf n/a (libcuda.so.1 + 0x327fbf)
                #3  0x00007fffcc251113 n/a (libcuda.so.1 + 0x251113)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067637:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067638:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067636:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067639:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067631:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x00000000004f662b _ZZN10common_log6resumeEvENKUlvE_clEv (llama-server + 0xf762b)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067643:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067646:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067644:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067647:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067645:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067649:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067635:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x000000000048683b _ZN15server_response4recvERKSt13unordered_setIiSt4hashIiESt8equal_toIiESaIiEE (llama-server + 0x8783b)
                #3  0x000000000049ea13 _ZN14server_context20receive_cmpl_resultsERKSt13unordered_setIiSt4hashIiESt8equal_toIiESaIiEERKSt8functionIFvRSt6vectorI18server_task_resultSaISB_EEEERKS9_IFvN8nlohmann16json_abi_v3_11_310basic_jsonINSK_11ordered_mapESA_NSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNSK_14adl_serializerESA_IhSaIhEEvEEEE (llama-server + 0x9fa13)
                #4  0x000000000043e50c _ZZ4mainENKUl21server_task_cmpl_typeRN8nlohmann16json_abi_v3_11_310basic_jsonINS1_11ordered_mapESt6vectorNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNS1_14adl_serializerES4_IhSaIhEEvEERN7httplib8ResponseEE_clES_SF_SI_ (llama-server + 0x3f50c)
                #5  0x000000000043e702 _ZNSt17_Function_handlerIFvRKN7httplib7RequestERNS0_8ResponseEEZ4mainEUlS3_S5_E10_E9_M_invokeERKSt9_Any_dataS3_S5_ (llama-server + 0x3f702)
                #6  0x0000000000446911 _ZNK7httplib6Server16dispatch_requestERNS_7RequestERNS_8ResponseERKSt6vectorISt4pairISt10unique_ptrINS_6detail11MatcherBaseESt14default_deleteIS9_EESt8functionIFvRKS1_S4_EEESaISI_EE.isra.0 (llama-server + 0x47911)
                #7  0x00000000004b164e _ZN7httplib6Server15process_requestERNS_6StreamEbRbRKSt8functionIFvRNS_7RequestEEE (llama-server + 0xb264e)
                #8  0x00000000004b1e9e _ZN7httplib6detail26process_server_socket_coreIZNS0_21process_server_socketIZNS_6Server24process_and_close_socketEiEUlRNS_6StreamEbRbE_EEbRKSt6atomicIiEimlllllT_EUlbS6_E_EEbSB_imlSC_ (llama-server + 0xb2e9e)
                #9  0x00000000004b2178 _ZNSt17_Function_handlerIFvvEZN7httplib6Server15listen_internalEvEUlvE0_E9_M_invokeERKSt9_Any_data (llama-server + 0xb3178)
                #10 0x00000000004529bc _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x539bc)
                #11 0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #12 0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #13 0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067648:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067654:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067640:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x000000000048683b _ZN15server_response4recvERKSt13unordered_setIiSt4hashIiESt8equal_toIiESaIiEE (llama-server + 0x8783b)
                #3  0x000000000049ea13 _ZN14server_context20receive_cmpl_resultsERKSt13unordered_setIiSt4hashIiESt8equal_toIiESaIiEERKSt8functionIFvRSt6vectorI18server_task_resultSaISB_EEEERKS9_IFvN8nlohmann16json_abi_v3_11_310basic_jsonINSK_11ordered_mapESA_NSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNSK_14adl_serializerESA_IhSaIhEEvEEEE (llama-server + 0x9fa13)
                #4  0x000000000043e50c _ZZ4mainENKUl21server_task_cmpl_typeRN8nlohmann16json_abi_v3_11_310basic_jsonINS1_11ordered_mapESt6vectorNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNS1_14adl_serializerES4_IhSaIhEEvEERN7httplib8ResponseEE_clES_SF_SI_ (llama-server + 0x3f50c)
                #5  0x000000000043e702 _ZNSt17_Function_handlerIFvRKN7httplib7RequestERNS0_8ResponseEEZ4mainEUlS3_S5_E10_E9_M_invokeERKSt9_Any_dataS3_S5_ (llama-server + 0x3f702)
                #6  0x0000000000446911 _ZNK7httplib6Server16dispatch_requestERNS_7RequestERNS_8ResponseERKSt6vectorISt4pairISt10unique_ptrINS_6detail11MatcherBaseESt14default_deleteIS9_EESt8functionIFvRKS1_S4_EEESaISI_EE.isra.0 (llama-server + 0x47911)
                #7  0x00000000004b164e _ZN7httplib6Server15process_requestERNS_6StreamEbRbRKSt8functionIFvRNS_7RequestEEE (llama-server + 0xb264e)
                #8  0x00000000004b1e9e _ZN7httplib6detail26process_server_socket_coreIZNS0_21process_server_socketIZNS_6Server24process_and_close_socketEiEUlRNS_6StreamEbRbE_EEbRKSt6atomicIiEimlllllT_EUlbS6_E_EEbSB_imlSC_ (llama-server + 0xb2e9e)
                #9  0x00000000004b2178 _ZNSt17_Function_handlerIFvvEZN7httplib6Server15listen_internalEvEUlvE0_E9_M_invokeERKSt9_Any_data (llama-server + 0xb3178)
                #10 0x00000000004529bc _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x539bc)
                #11 0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #12 0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #13 0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067633:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067655:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067641:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x000000000048683b _ZN15server_response4recvERKSt13unordered_setIiSt4hashIiESt8equal_toIiESaIiEE (llama-server + 0x8783b)
                #3  0x000000000049ea13 _ZN14server_context20receive_cmpl_resultsERKSt13unordered_setIiSt4hashIiESt8equal_toIiESaIiEERKSt8functionIFvRSt6vectorI18server_task_resultSaISB_EEEERKS9_IFvN8nlohmann16json_abi_v3_11_310basic_jsonINSK_11ordered_mapESA_NSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNSK_14adl_serializerESA_IhSaIhEEvEEEE (llama-server + 0x9fa13)
                #4  0x000000000043e50c _ZZ4mainENKUl21server_task_cmpl_typeRN8nlohmann16json_abi_v3_11_310basic_jsonINS1_11ordered_mapESt6vectorNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEblmdSaNS1_14adl_serializerES4_IhSaIhEEvEERN7httplib8ResponseEE_clES_SF_SI_ (llama-server + 0x3f50c)
                #5  0x000000000043e702 _ZNSt17_Function_handlerIFvRKN7httplib7RequestERNS0_8ResponseEEZ4mainEUlS3_S5_E10_E9_M_invokeERKSt9_Any_dataS3_S5_ (llama-server + 0x3f702)
                #6  0x0000000000446911 _ZNK7httplib6Server16dispatch_requestERNS_7RequestERNS_8ResponseERKSt6vectorISt4pairISt10unique_ptrINS_6detail11MatcherBaseESt14default_deleteIS9_EESt8functionIFvRKS1_S4_EEESaISI_EE.isra.0 (llama-server + 0x47911)
                #7  0x00000000004b164e _ZN7httplib6Server15process_requestERNS_6StreamEbRbRKSt8functionIFvRNS_7RequestEEE (llama-server + 0xb264e)
                #8  0x00000000004b1e9e _ZN7httplib6detail26process_server_socket_coreIZNS0_21process_server_socketIZNS_6Server24process_and_close_socketEiEUlRNS_6StreamEbRbE_EEbRKSt6atomicIiEimlllllT_EUlbS6_E_EEbSB_imlSC_ (llama-server + 0xb2e9e)
                #9  0x00000000004b2178 _ZNSt17_Function_handlerIFvvEZN7httplib6Server15listen_internalEvEUlvE0_E9_M_invokeERKSt9_Any_data (llama-server + 0xb3178)
                #10 0x00000000004529bc _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x539bc)
                #11 0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #12 0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #13 0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067651:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067660:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067642:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067634:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067632:
                #0  0x00007ffff331b01f accept (libc.so.6 + 0x11201f)
                #1  0x000000000042d0e4 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZ4mainEUlvE1_EEEEE6_M_runEv (llama-server + 0x2e0e4)
                #2  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #3  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #4  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067650:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067665:
                #0  0x00007ffff330ad1f __poll (libc.so.6 + 0x101d1f)
                #1  0x00007fffcc254e3f n/a (libcuda.so.1 + 0x254e3f)
                #2  0x00007fffcc327fbf n/a (libcuda.so.1 + 0x327fbf)
                #3  0x00007fffcc251113 n/a (libcuda.so.1 + 0x251113)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067657:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067664:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298f45 pthread_cond_timedwait@@GLIBC_2.3.2 (libc.so.6 + 0x8ff45)
                #2  0x00007fffcc1aebca n/a (libcuda.so.1 + 0x1aebca)
                #3  0x00007fffcc251113 n/a (libcuda.so.1 + 0x251113)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067658:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067661:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067652:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067653:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067656:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067662:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067659:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)

                Stack trace of thread 1067663:
                #0  0x00007ffff32960ce __futex_abstimed_wait_common (libc.so.6 + 0x8d0ce)
                #1  0x00007ffff3298c20 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8fc20)
                #2  0x0000000000452a46 _ZN7httplib10ThreadPool6workerclEv (llama-server + 0x53a46)
                #3  0x00007ffff34e86d3 execute_native_thread_routine (libstdc++.so.6 + 0xe86d3)
                #4  0x00007ffff3299a42 start_thread (libc.so.6 + 0x90a42)
                #5  0x00007ffff331905c __clone3 (libc.so.6 + 0x11005c)
                ELF object binary architecture: AMD x86-64

Reading symbols from /nix/store/xsjknx60if36j5d8kl393yb23hhf76ic-llama-cpp-3933/bin/llama-server...

warning: Loadable section ".dynstr" outside of ELF segments
  in /nix/store/xsjknx60if36j5d8kl393yb23hhf76ic-llama-cpp-3933/bin/llama-server
Reading symbols from /nix/store/jvyl2rg6mff5c6z3477sbip03w86rwjw-llama-cpp-3933-debug/lib/debug/.build-id/db/86f367231952a378ab1268136a29fb91d5a98b.debug...

warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing
[New LWP 1067626]
<... snipped text...>
[New LWP 1067663]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/3dyw8dzj9ab4m8hv5dpyx7zii8d0w6fi-glibc-2.39-52/lib/libthread_db.so.1".
Core was generated by `llama-server -c 102400 -ngl 100 -m Replete-LLM-V2.5-Qwen-14b-IQ3_M.gguf --chat-'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0)
    at pthread_kill.c:44
44	      return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
[Current thread is 1 (Thread 0x7ffff3787000 (LWP 1067626))]
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0)
    at pthread_kill.c:44
#1  0x00007ffff329b843 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff3249516 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff3231935 in __GI_abort () at abort.c:79
#4  0x00007ffff381c7c5 in ggml_abort (file=0x7ffff3a3e7e0 "/build/source/ggml/src/ggml-cuda.cu", line=70,
    fmt=0x7ffff3a32d51 "CUDA error") at /build/source/ggml/src/ggml.c:305
#5  0x00007ffff38ea863 in ggml_cuda_error (stmt=stmt@entry=0x7ffff3a3f1c0 "cudaStreamSynchronize(cuda_ctx->stream())",
    func=func@entry=0x7ffff3a32e36 "ggml_backend_cuda_synchronize",
    file=file@entry=0x7ffff3a3e7e0 "/build/source/ggml/src/ggml-cuda.cu", line=line@entry=2446,
    msg=0x7ffff2e8db00 "an illegal memory access was encountered") at /build/source/ggml/src/ggml-cuda.cu:70
#6  0x00007ffff38eb80a in ggml_backend_cuda_synchronize (backend=<optimized out>)
    at /build/source/ggml/src/ggml-cuda.cu:2446
#7  0x00007ffff38759e6 in ggml_backend_sched_synchronize (sched=sched@entry=0x127e630)
    at /build/source/ggml/src/ggml-backend.cpp:2349
#8  0x00007ffff3877873 in ggml_backend_sched_reserve (sched=0x127e630, measure_graph=<optimized out>)
    at /build/source/ggml/src/ggml-backend.cpp:2307
#9  0x00007ffff7e90076 in llama_kv_cache_update_internal (lctx=...) at /build/source/src/llama.cpp:17891
#10 0x00007ffff7e90c25 in llama_kv_cache_update (ctx=<optimized out>) at /build/source/src/llama.cpp:20123
#11 0x00007ffff7e96c53 in llama_decode_internal (batch_all=..., lctx=...) at /build/source/src/llama.cpp:17248
#12 llama_decode (ctx=0x1269150, batch=...) at /build/source/src/llama.cpp:21200
#13 0x000000000049fd82 in server_context::update_slots (this=<optimized out>)
    at /build/source/examples/server/server.cpp:2292
#14 0x0000000000487e99 in std::function<void()>::operator() (this=0x7fffffffb1a8)
    at /nix/store/6mmwy4jcnqnhms3i56r1hbdn656akg1d-gcc-13.3.0/include/c++/13.3.0/bits/std_function.h:591
#15 server_queue::start_loop (this=this@entry=0x7fffffffb088) at /build/source/examples/server/server.cpp:504
#16 0x000000000042644e in main (argc=<optimized out>, argv=<optimized out>)
    at /build/source/examples/server/server.cpp:3402
(gdb) set substitute-path /build/source /home/sliedes/proj/llama.cpp
(gdb) fra 4
#4  0x00007ffff381c7c5 in ggml_abort (file=0x7ffff3a3e7e0 "/build/source/ggml/src/ggml-cuda.cu", line=70,
    fmt=0x7ffff3a32d51 "CUDA error") at /build/source/ggml/src/ggml.c:305
305	    abort();
(gdb) q
```

### Name and Version

In reality, this is b3933 (f010b77a) on NixOS; the build scripts seem to report version 0:

-----
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 4090)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen Threadripper PRO 5955WX 16-Cores)
version: 0 (unknown)
built with gcc (GCC) 13.3.0 for x86_64-unknown-linux-gnu
-----

### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell
$ llama-server -c 102400 -ngl 100 -m Replete-LLM-V2.5-Qwen-14b-IQ3_M.gguf --chat-template chatml --check-tensors -ctk q8_0 -ctv q8_0 -fa --parallel 10
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 4090)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen Threadripper PRO 5955WX 16-Cores)
build: 0 (unknown) with gcc (GCC) 13.3.0 for x86_64-unknown-linux-gnu (debug)
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 31
main: loading model
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 18924 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 579 tensors from Replete-LLM-V2.5-Qwen-14b-IQ3_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Replete LLM V2.5 Qwen 14b
llama_model_loader: - kv   3:                           general.basename str              = Replete-LLM-V2.5-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 14B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                   general.base_model.count u32              = 1
llama_model_loader: - kv   7:                  general.base_model.0.name str              = Qwen2.5 14B Instruct
llama_model_loader: - kv   8:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv   9:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-1...
llama_model_loader: - kv  10:                          qwen2.block_count u32              = 48
llama_model_loader: - kv  11:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  12:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv  13:                  qwen2.feed_forward_length u32              = 13824
llama_model_loader: - kv  14:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  15:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  16:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  17:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                          general.file_type u32              = 27
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                      quantize.imatrix.file str              = /models_out/Replete-LLM-V2.5-Qwen-14b...
llama_model_loader: - kv  31:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  32:             quantize.imatrix.entries_count i32              = 336
llama_model_loader: - kv  33:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q4_K:  102 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  235 tensors
llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG
llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG
llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG
llm_load_vocab: control token: 151648 '<|box_start|>' is not marked as EOG
llm_load_vocab: control token: 151646 '<|object_ref_start|>' is not marked as EOG
llm_load_vocab: control token: 151649 '<|box_end|>' is not marked as EOG
llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG
llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG
llm_load_vocab: control token: 151647 '<|object_ref_end|>' is not marked as EOG
llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG
llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG
llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG
llm_load_vocab: control token: 151644 '<|im_start|>' is not marked as EOG
llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG
llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ3_S mix - 3.66 bpw
llm_load_print_meta: model params     = 14.77 B
llm_load_print_meta: model size       = 6.44 GiB (3.74 BPW)
llm_load_print_meta: general.name     = Replete LLM V2.5 Qwen 14b
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.51 MiB
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors:        CPU buffer size =   319.04 MiB
llm_load_tensors:      CUDA0 buffer size =  6271.39 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 102400
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size = 10200.00 MiB
llama_new_context_with_model: KV self size  = 10200.00 MiB, K (q8_0): 5100.00 MiB, V (q8_0): 5100.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     6.38 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   340.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   210.01 MiB
llama_new_context_with_model: graph nodes  = 1495
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 10
slot         init: id  0 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  1 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  2 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  3 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  4 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  5 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  6 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  7 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  8 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  9 | task -1 | new slot n_ctx_slot = 10240
main: model loaded
main: chat template, built_in: 0, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on 127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
request: GET /props 127.0.0.1 200
request: POST /tokenize 127.0.0.1 200
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | tokenizing prompt, len = 1
slot update_slots: id  0 | task 0 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8594
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.238306
slot launch_slot_: id  1 | task 2 | processing task
slot launch_slot_: id  2 | task 3 | processing task
slot launch_slot_: id  3 | task 4 | processing task
slot launch_slot_: id  4 | task 5 | processing task
slot launch_slot_: id  5 | task 6 | processing task
slot launch_slot_: id  6 | task 7 | processing task
slot launch_slot_: id  7 | task 8 | processing task
slot launch_slot_: id  8 | task 9 | processing task
slot launch_slot_: id  9 | task 10 | processing task
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.476612
slot update_slots: id  0 | task 0 | kv cache rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.714917
slot update_slots: id  0 | task 0 | kv cache rm [6144, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.953223
slot update_slots: id  0 | task 0 | kv cache rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 8594, n_tokens = 402, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 8594, n_tokens = 402
slot update_slots: id  1 | task 2 | tokenizing prompt, len = 1
slot update_slots: id  1 | task 2 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8585
slot update_slots: id  1 | task 2 | kv cache rm [0, end)
slot update_slots: id  1 | task 2 | prompt processing progress, n_past = 1646, n_tokens = 2048, progress = 0.191730
slot update_slots: id  1 | task 2 | kv cache rm [1646, end)
slot update_slots: id  1 | task 2 | prompt processing progress, n_past = 3693, n_tokens = 2048, progress = 0.430169
slot update_slots: id  1 | task 2 | kv cache rm [3693, end)
slot update_slots: id  1 | task 2 | prompt processing progress, n_past = 5740, n_tokens = 2048, progress = 0.668608
slot update_slots: id  1 | task 2 | kv cache rm [5740, end)
slot update_slots: id  1 | task 2 | prompt processing progress, n_past = 7787, n_tokens = 2048, progress = 0.907047
slot update_slots: id  1 | task 2 | kv cache rm [7787, end)
slot update_slots: id  1 | task 2 | prompt processing progress, n_past = 8585, n_tokens = 799, progress = 1.000000
slot update_slots: id  1 | task 2 | prompt done, n_past = 8585, n_tokens = 799
slot update_slots: id  2 | task 3 | tokenizing prompt, len = 1
slot update_slots: id  2 | task 3 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8562
slot update_slots: id  2 | task 3 | kv cache rm [0, end)
slot update_slots: id  2 | task 3 | prompt processing progress, n_past = 1249, n_tokens = 2048, progress = 0.145877
slot update_slots: id  2 | task 3 | kv cache rm [1249, end)
slot update_slots: id  2 | task 3 | prompt processing progress, n_past = 3295, n_tokens = 2048, progress = 0.384840
slot update_slots: id  2 | task 3 | kv cache rm [3295, end)
slot update_slots: id  2 | task 3 | prompt processing progress, n_past = 5341, n_tokens = 2048, progress = 0.623803
slot update_slots: id  2 | task 3 | kv cache rm [5341, end)
slot update_slots: id  2 | task 3 | prompt processing progress, n_past = 7387, n_tokens = 2048, progress = 0.862766
slot update_slots: id  2 | task 3 | kv cache rm [7387, end)
slot update_slots: id  2 | task 3 | prompt processing progress, n_past = 8562, n_tokens = 1177, progress = 1.000000
slot update_slots: id  2 | task 3 | prompt done, n_past = 8562, n_tokens = 1177
slot update_slots: id  3 | task 4 | tokenizing prompt, len = 1
slot update_slots: id  3 | task 4 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8457
slot update_slots: id  3 | task 4 | kv cache rm [0, end)
slot update_slots: id  3 | task 4 | prompt processing progress, n_past = 871, n_tokens = 2048, progress = 0.102992
slot update_slots: id  3 | task 4 | kv cache rm [871, end)
slot update_slots: id  3 | task 4 | prompt processing progress, n_past = 2916, n_tokens = 2048, progress = 0.344803
slot update_slots: id  3 | task 4 | kv cache rm [2916, end)
slot update_slots: id  3 | task 4 | prompt processing progress, n_past = 4961, n_tokens = 2048, progress = 0.586615
slot update_slots: id  3 | task 4 | kv cache rm [4961, end)
slot update_slots: id  3 | task 4 | prompt processing progress, n_past = 7006, n_tokens = 2048, progress = 0.828426
slot update_slots: id  3 | task 4 | kv cache rm [7006, end)
slot update_slots: id  3 | task 4 | prompt processing progress, n_past = 8457, n_tokens = 1454, progress = 1.000000
slot update_slots: id  3 | task 4 | prompt done, n_past = 8457, n_tokens = 1454
slot update_slots: id  4 | task 5 | tokenizing prompt, len = 1
slot update_slots: id  4 | task 5 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 9007
slot update_slots: id  4 | task 5 | kv cache rm [0, end)
slot update_slots: id  4 | task 5 | prompt processing progress, n_past = 594, n_tokens = 2048, progress = 0.065949
slot update_slots: id  4 | task 5 | kv cache rm [594, end)
slot update_slots: id  4 | task 5 | prompt processing progress, n_past = 2638, n_tokens = 2048, progress = 0.292883
slot update_slots: id  4 | task 5 | kv cache rm [2638, end)
slot update_slots: id  4 | task 5 | prompt processing progress, n_past = 4682, n_tokens = 2048, progress = 0.519818
slot update_slots: id  4 | task 5 | kv cache rm [4682, end)
slot update_slots: id  4 | task 5 | prompt processing progress, n_past = 6726, n_tokens = 2048, progress = 0.746753
slot update_slots: id  4 | task 5 | kv cache rm [6726, end)
slot update_slots: id  4 | task 5 | prompt processing progress, n_past = 8770, n_tokens = 2048, progress = 0.973687
slot update_slots: id  4 | task 5 | kv cache rm [8770, end)
slot update_slots: id  4 | task 5 | prompt processing progress, n_past = 9007, n_tokens = 241, progress = 1.000000
slot update_slots: id  4 | task 5 | prompt done, n_past = 9007, n_tokens = 241
slot update_slots: id  5 | task 6 | tokenizing prompt, len = 1
slot update_slots: id  5 | task 6 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8853
slot update_slots: id  5 | task 6 | kv cache rm [0, end)
slot update_slots: id  5 | task 6 | prompt processing progress, n_past = 1807, n_tokens = 2048, progress = 0.204112
slot update_slots: id  5 | task 6 | kv cache rm [1807, end)
slot update_slots: id  5 | task 6 | prompt processing progress, n_past = 3850, n_tokens = 2048, progress = 0.434881
slot update_slots: id  5 | task 6 | kv cache rm [3850, end)
slot update_slots: id  5 | task 6 | prompt processing progress, n_past = 5893, n_tokens = 2048, progress = 0.665650
slot update_slots: id  5 | task 6 | kv cache rm [5893, end)
slot update_slots: id  5 | task 6 | prompt processing progress, n_past = 7936, n_tokens = 2048, progress = 0.896419
slot update_slots: id  5 | task 6 | kv cache rm [7936, end)
slot update_slots: id  5 | task 6 | prompt processing progress, n_past = 8853, n_tokens = 922, progress = 1.000000
slot update_slots: id  5 | task 6 | prompt done, n_past = 8853, n_tokens = 922
slot update_slots: id  6 | task 7 | tokenizing prompt, len = 1
slot update_slots: id  6 | task 7 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 9213
slot update_slots: id  6 | task 7 | kv cache rm [0, end)
slot update_slots: id  6 | task 7 | prompt processing progress, n_past = 1126, n_tokens = 2048, progress = 0.122219
slot update_slots: id  6 | task 7 | kv cache rm [1126, end)
slot update_slots: id  6 | task 7 | prompt processing progress, n_past = 3168, n_tokens = 2048, progress = 0.343862
slot update_slots: id  6 | task 7 | kv cache rm [3168, end)
slot update_slots: id  6 | task 7 | prompt processing progress, n_past = 5210, n_tokens = 2048, progress = 0.565505
slot update_slots: id  6 | task 7 | kv cache rm [5210, end)
slot update_slots: id  6 | task 7 | prompt processing progress, n_past = 7252, n_tokens = 2048, progress = 0.787149
slot update_slots: id  6 | task 7 | kv cache rm [7252, end)
slot update_slots: id  6 | task 7 | prompt processing progress, n_past = 9213, n_tokens = 1967, progress = 1.000000
slot update_slots: id  6 | task 7 | prompt done, n_past = 9213, n_tokens = 1967
slot update_slots: id  7 | task 8 | tokenizing prompt, len = 1
slot update_slots: id  7 | task 8 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 9446
slot update_slots: id  7 | task 8 | kv cache rm [0, end)
slot update_slots: id  7 | task 8 | prompt processing progress, n_past = 81, n_tokens = 2048, progress = 0.008575
slot update_slots: id  7 | task 8 | kv cache rm [81, end)
slot update_slots: id  7 | task 8 | prompt processing progress, n_past = 2122, n_tokens = 2048, progress = 0.224645
slot update_slots: id  7 | task 8 | kv cache rm [2122, end)
slot update_slots: id  7 | task 8 | prompt processing progress, n_past = 4163, n_tokens = 2048, progress = 0.440716
slot update_slots: id  7 | task 8 | kv cache rm [4163, end)
slot update_slots: id  7 | task 8 | prompt processing progress, n_past = 6204, n_tokens = 2048, progress = 0.656786
slot update_slots: id  7 | task 8 | kv cache rm [6204, end)
slot update_slots: id  7 | task 8 | prompt processing progress, n_past = 8245, n_tokens = 2048, progress = 0.872856
slot update_slots: id  7 | task 8 | kv cache rm [8245, end)
slot update_slots: id  7 | task 8 | prompt processing progress, n_past = 9446, n_tokens = 1208, progress = 1.000000
slot update_slots: id  7 | task 8 | prompt done, n_past = 9446, n_tokens = 1208
slot update_slots: id  8 | task 9 | tokenizing prompt, len = 1
slot update_slots: id  8 | task 9 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8661
slot update_slots: id  8 | task 9 | kv cache rm [0, end)
slot update_slots: id  8 | task 9 | prompt processing progress, n_past = 840, n_tokens = 2048, progress = 0.096986
slot update_slots: id  8 | task 9 | kv cache rm [840, end)
slot update_slots: id  8 | task 9 | prompt processing progress, n_past = 2880, n_tokens = 2048, progress = 0.332525
slot update_slots: id  8 | task 9 | kv cache rm [2880, end)
slot update_slots: id  8 | task 9 | prompt processing progress, n_past = 4920, n_tokens = 2048, progress = 0.568064
slot update_slots: id  8 | task 9 | kv cache rm [4920, end)
slot update_slots: id  8 | task 9 | prompt processing progress, n_past = 6960, n_tokens = 2048, progress = 0.803602
slot update_slots: id  8 | task 9 | kv cache rm [6960, end)
slot update_slots: id  8 | task 9 | prompt processing progress, n_past = 8661, n_tokens = 1709, progress = 1.000000
slot update_slots: id  8 | task 9 | prompt done, n_past = 8661, n_tokens = 1709
slot update_slots: id  9 | task 10 | tokenizing prompt, len = 1
slot update_slots: id  9 | task 10 | prompt tokenized, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 8390
slot update_slots: id  9 | task 10 | kv cache rm [0, end)
slot update_slots: id  9 | task 10 | prompt processing progress, n_past = 339, n_tokens = 2048, progress = 0.040405
slot update_slots: id  9 | task 10 | kv cache rm [339, end)
slot update_slots: id  9 | task 10 | prompt processing progress, n_past = 2378, n_tokens = 2048, progress = 0.283433
slot update_slots: id  9 | task 10 | kv cache rm [2378, end)
slot update_slots: id  9 | task 10 | prompt processing progress, n_past = 4417, n_tokens = 2048, progress = 0.526460
slot update_slots: id  9 | task 10 | kv cache rm [4417, end)
slot update_slots: id  9 | task 10 | prompt processing progress, n_past = 6456, n_tokens = 2048, progress = 0.769488
slot update_slots: id  9 | task 10 | kv cache rm [6456, end)
slot update_slots: id  9 | task 10 | prompt processing progress, n_past = 8390, n_tokens = 1943, progress = 1.000000
slot update_slots: id  9 | task 10 | prompt done, n_past = 8390, n_tokens = 1943
slot      release: id  8 | task 9 | stop processing: n_past = 8904, truncated = 0
slot print_timing: id  8 | task 9 |
prompt eval time =   14863.10 ms /  8661 tokens (    1.72 ms per token,   582.72 tokens per second)
       eval time =   37648.51 ms /   244 tokens (  154.30 ms per token,     6.48 tokens per second)
      total time =   52511.61 ms /  8905 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  2 | task 3 | stop processing: n_past = 9072, truncated = 0
slot print_timing: id  2 | task 3 |
prompt eval time =    5774.81 ms /  8562 tokens (    0.67 ms per token,  1482.65 tokens per second)
       eval time =  120757.56 ms /   511 tokens (  236.32 ms per token,     4.23 tokens per second)
      total time =  126532.36 ms /  9073 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  6 | task 7 | stop processing: n_past = 9708, truncated = 0
slot print_timing: id  6 | task 7 |
prompt eval time =   11687.75 ms /  9213 tokens (    1.27 ms per token,   788.26 tokens per second)
       eval time =   88451.60 ms /   496 tokens (  178.33 ms per token,     5.61 tokens per second)
      total time =  100139.35 ms /  9709 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  9 | task 10 | stop processing: n_past = 8899, truncated = 0
slot print_timing: id  9 | task 10 |
prompt eval time =   16603.01 ms /  8390 tokens (    1.98 ms per token,   505.33 tokens per second)
       eval time =   52316.64 ms /   510 tokens (  102.58 ms per token,     9.75 tokens per second)
      total time =   68919.64 ms /  8900 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  4 | task 5 | stop processing: n_past = 9582, truncated = 0
slot print_timing: id  4 | task 5 |
prompt eval time =   10398.17 ms /  9007 tokens (    1.15 ms per token,   866.21 tokens per second)
       eval time =  113429.50 ms /   576 tokens (  196.93 ms per token,     5.08 tokens per second)
      total time =  123827.67 ms /  9583 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  0 | task 0 | stop processing: n_past = 9206, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    3030.07 ms /  8594 tokens (    0.35 ms per token,  2836.24 tokens per second)
       eval time =  138377.86 ms /   613 tokens (  225.74 ms per token,     4.43 tokens per second)
      total time =  141407.93 ms /  9207 tokens
request: POST /completion 127.0.0.1 200
slot      release: id  3 | task 4 | stop processing: n_past = 9216, truncated = 0
slot print_timing: id  3 | task 4 |
prompt eval time =    7145.12 ms /  8457 tokens (    0.84 ms per token,  1183.61 tokens per second)
       eval time =  139217.79 ms /   760 tokens (  183.18 ms per token,     5.46 tokens per second)
      total time =  146362.91 ms /  9217 tokens
request: POST /completion 127.0.0.1 200
slot update_slots: id  7 | task 8 | slot context shift, n_keep = 0, n_left = 10239, n_discard = 5119
/build/source/ggml/src/ggml-cuda.cu:70: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /build/source/ggml/src/ggml-cuda.cu:2446
  cudaStreamSynchronize(cuda_ctx->stream())
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Occasional crashes when a connection has been interrupted before completion of computation #9928

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Occasional crashes when a connection has been interrupted before completion of computation #9928

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions