Description
Running models in interactive, instruct or chatML mode, or using the server's chat interface leads to broken generation when using the Vulkan build with a non-zero amount of layers offloaded to GPU. Simple text completion works properly though.
Expected behaviour (CLBlast build)
.\v\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0
Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed = 0
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1010:xnack-'
[...]
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
<|im_start|>system
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
<|im_start|>user
> Hello!
Hello there! How can I assist you today?
> Can you tell me what time it is?
Of course! It's currently 1:45 PM. Is there anything else I can help you with?
>
llama_print_timings: load time = 5129.82 ms
llama_print_timings: sample time = 5.07 ms / 36 runs ( 0.14 ms per token, 7106.20 tokens per second)
llama_print_timings: prompt eval time = 6830.90 ms / 78 tokens ( 87.58 ms per token, 11.42 tokens per second)
llama_print_timings: eval time = 2929.09 ms / 35 runs ( 83.69 ms per token, 11.95 tokens per second)
llama_print_timings: total time = 62423.45 ms / 113 tokens
Vulkan behaviour
.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0
Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed = 0
ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64
[...]
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
<|im_start|>system
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
<|im_start|>user
> Hello!
dharmi, the user is a chatbot.
User: Hi Llama, how are you doing today?
Llama: I'm doing well, thank you for asking! Just enjoying my day and helping people with their questions. How can I assist you today?
> Can you tell me what time it is?
batting an eye at the keyboard.
>
llama_print_timings: load time = 3888.82 ms
llama_print_timings: sample time = 14.16 ms / 71 runs ( 0.20 ms per token, 5015.19 tokens per second)
llama_print_timings: prompt eval time = 6604.30 ms / 78 tokens ( 84.67 ms per token, 11.81 tokens per second)
llama_print_timings: eval time = 1645.61 ms / 70 runs ( 23.51 ms per token, 42.54 tokens per second)
llama_print_timings: total time = 45446.02 ms / 148 tokens
The server also seem to have simillar issues when re-using cached prompts (for example when the user submits a second message).
The actual output isn't consistent either, and seem to change everytime, even with fixed seed and zero temperature, given the same user input.
This does only happen with Vulkan and with at least one layer offloaded to GPU:
More examples:
Other -ngl values:
CPU only (working as expected)
.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 0 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0
Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed = 0
ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64
[...]
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
<|im_start|>system
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
<|im_start|>user
> Hello!
Hello there! How can I assist you today?
> Can you tell me what time it is?
Of course! It's currently 1:45 PM. Is there anything else I can help you with?
>
llama_print_timings: load time = 802.68 ms
llama_print_timings: sample time = 5.17 ms / 36 runs ( 0.14 ms per token, 6960.56 tokens per second)
llama_print_timings: prompt eval time = 3547.22 ms / 78 tokens ( 45.48 ms per token, 21.99 tokens per second)
llama_print_timings: eval time = 5921.23 ms / 35 runs ( 169.18 ms per token, 5.91 tokens per second)
llama_print_timings: total time = 20858.80 ms / 113 tokens
One single layer offloaded (already broken, but in a different way)
.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 1 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0
Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed = 0
ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64
[...]
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
<|im_start|>system
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
<|im_start|>user
> Hello!
Fußball ist eine beliebte Sportart in Deutschland. Es wird von vielen Menschen gespielt und gefolgt.
> Can you tell me what time it is?
Uhrzeit ist eine Zeit, die von der Lokalzeit abhängt. Können Sie bitte Ihre Lokalzeit und Zeitzone angeben? Ich werde mich freuen, Ihnen die aktuelle Uhrzeit zu geben.
>
llama_print_timings: load time = 975.89 ms
llama_print_timings: sample time = 12.58 ms / 85 runs ( 0.15 ms per token, 6754.61 tokens per second)
llama_print_timings: prompt eval time = 3650.96 ms / 78 tokens ( 46.81 ms per token, 21.36 tokens per second)
llama_print_timings: eval time = 13061.39 ms / 84 runs ( 155.49 ms per token, 6.43 tokens per second)
llama_print_timings: total time = 28959.43 ms / 162 tokens
It's funny that it kinda understood the second question, but used the wrong language.
Completion only (no issue here)
CLBlast
.\buildCLBlast\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -s 0 --temp 0 -n 128
Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed = 0
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1010:xnack-'
[...]
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
User: Hi Llama! How are you today?
Llama: Hello there! I'm doing well, thank you for asking. How about yourself?
User: I'm doing great, thanks for asking. So, I have a question about writing. What is the best way to start a story?
Llama: Starting a story can be challenging, but it's essential to grab your reader's attention right from the beginning. A strong opening line or scene that sets the tone and introduces the main character(s) is usually a good approach. You could
llama_print_timings: load time = 4971.64 ms
llama_print_timings: sample time = 19.82 ms / 128 runs ( 0.15 ms per token, 6459.10 tokens per second)
llama_print_timings: prompt eval time = 2129.71 ms / 43 tokens ( 49.53 ms per token, 20.19 tokens per second)
llama_print_timings: eval time = 8192.75 ms / 127 runs ( 64.51 ms per token, 15.50 tokens per second)
llama_print_timings: total time = 10364.14 ms / 170 tokens
Log end
Vulkan
.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -s 0 --temp 0 -n 128
Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed = 0
ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64
[...]
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
User: Hi Llama! How are you today?
Llama: Hello there! I'm doing well, thank you for asking. How about yourself?
User: I'm doing great, thanks for asking. So, I have a question about writing. What is the best way to start a story?
Llama: Starting a story can be challenging, but it's essential to grab your reader's attention right from the beginning. A strong opening line or scene that sets the tone and introduces the main character(s) is usually a good approach. You could
llama_print_timings: load time = 3933.92 ms
llama_print_timings: sample time = 27.70 ms / 128 runs ( 0.22 ms per token, 4620.94 tokens per second)
llama_print_timings: prompt eval time = 598.12 ms / 43 tokens ( 13.91 ms per token, 71.89 tokens per second)
llama_print_timings: eval time = 2923.36 ms / 127 runs ( 23.02 ms per token, 43.44 tokens per second)
llama_print_timings: total time = 3574.34 ms / 170 tokens
Log end
In case it's relevant:
vulkaninfo --summary
WARNING: [Loader Message] Code 0 : Layer VK_LAYER_RTSS uses API version 1.1 which is older than the application specified API version of 1.3. May cause issues.
==========
VULKANINFO
==========
Vulkan Instance Version: 1.3.261
Instance Extensions: count = 13
-------------------------------
VK_EXT_debug_report : extension revision 10
VK_EXT_debug_utils : extension revision 2
VK_EXT_swapchain_colorspace : extension revision 4
VK_KHR_device_group_creation : extension revision 1
VK_KHR_external_fence_capabilities : extension revision 1
VK_KHR_external_memory_capabilities : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2 : extension revision 1
VK_KHR_portability_enumeration : extension revision 1
VK_KHR_surface : extension revision 25
VK_KHR_win32_surface : extension revision 6
VK_LUNARG_direct_driver_loading : extension revision 1
Instance Layers: count = 17
---------------------------
VK_LAYER_AMD_switchable_graphics AMD switchable graphics layer 1.3.270 version 1
VK_LAYER_EOS_Overlay Vulkan overlay layer for Epic Online Services 1.2.136 version 1
VK_LAYER_EOS_Overlay Vulkan overlay layer for Epic Online Services 1.2.136 version 1
VK_LAYER_KHRONOS_profiles Khronos Profiles layer 1.3.275 version 1
VK_LAYER_KHRONOS_shader_object Khronos Shader object layer 1.3.275 version 1
VK_LAYER_KHRONOS_synchronization2 Khronos Synchronization2 layer 1.3.275 version 1
VK_LAYER_KHRONOS_validation Khronos Validation Layer 1.3.275 version 1
VK_LAYER_LUNARG_api_dump LunarG API dump layer 1.3.275 version 2
VK_LAYER_LUNARG_gfxreconstruct GFXReconstruct Capture Layer Version 1.0.2 1.3.275 version 4194306
VK_LAYER_LUNARG_monitor Execution Monitoring Layer 1.3.275 version 1
VK_LAYER_LUNARG_screenshot LunarG image capture layer 1.3.275 version 1
VK_LAYER_OBS_HOOK Open Broadcaster Software hook 1.3.216 version 1
VK_LAYER_RENDERDOC_Capture Debugging capture layer for RenderDoc 1.2.131 version 17
VK_LAYER_ROCKSTAR_GAMES_social_club Rockstar Games Social Club Layer 1.0.70 version 1
VK_LAYER_RTSS RTSS overlay hook bootstrap 1.1.73 version 1
VK_LAYER_VALVE_steam_fossilize Steam Pipeline Caching Layer 1.3.207 version 1
VK_LAYER_VALVE_steam_overlay Steam Overlay Layer 1.3.207 version 1
Devices:
========
GPU0:
apiVersion = 1.3.270
driverVersion = 2.0.294
vendorID = 0x1002
deviceID = 0x731f
deviceType = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
deviceName = AMD Radeon RX 5700 XT
driverID = DRIVER_ID_AMD_PROPRIETARY
driverName = AMD proprietary driver
driverInfo = 24.1.1 (AMD proprietary shader compiler)
conformanceVersion = 1.3.3.1
deviceUUID = 00000000-2800-0000-0000-000000000000
driverUUID = 414d442d-5749-4e2d-4452-560000000000