Closed
Description
Inference with CLBlast fails with a segfault after the commit that merged #4256 on context sizes above 2k when all GPU layers are offloaded.
Command line:
C:\test\llama-b1601-bin-win-clblast-x64>main.exe -m E:\LLaMA\models\airoboros-mistral2.2-7b.Q4_K_S.gguf -c 4096 -b 512 -n 32 -ngl 33 -f C:\test\test.txt
main: build = 1601 (5a7d312)
main: built with MSVC 19.37.32826.1 for x64
main: seed = 1701534899
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce RTX 2060'
ggml_opencl: device FP16 support: false
Result:
Prompt processing starts, and then segfaults halfway around the 2k token mark, before generation begins. Only if the prompt is short enough (less than 2k tokens) it appears to work.