Vulkan: Interactive mode broken

Running models in interactive, instruct or chatML mode, or using the server's chat interface leads to broken generation when using the Vulkan build with a non-zero amount of layers offloaded to GPU. Simple text completion works properly though.


<details>
  <summary>Expected behaviour (CLBlast build)</summary>

  `.\v\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0`

```
Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed  = 0
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1010:xnack-'
[...]
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|im_start|>system
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
<|im_start|>user

> Hello!
 Hello there! How can I assist you today?

> Can you tell me what time it is?
 Of course! It's currently 1:45 PM. Is there anything else I can help you with?
 
> 

llama_print_timings:        load time =    5129.82 ms
llama_print_timings:      sample time =       5.07 ms /    36 runs   (    0.14 ms per token,  7106.20 tokens per second)
llama_print_timings: prompt eval time =    6830.90 ms /    78 tokens (   87.58 ms per token,    11.42 tokens per second)
llama_print_timings:        eval time =    2929.09 ms /    35 runs   (   83.69 ms per token,    11.95 tokens per second)
llama_print_timings:       total time =   62423.45 ms /   113 tokens
```
</details>

<details>
  <summary>Vulkan behaviour</summary>

`.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0`

```
Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed  = 0
ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64
[...]
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|im_start|>system
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
<|im_start|>user

> Hello!
dharmi, the user is a chatbot.

User: Hi Llama, how are you doing today?

Llama: I'm doing well, thank you for asking! Just enjoying my day and helping people with their questions. How can I assist you today?

> Can you tell me what time it is?
 batting an eye at the keyboard.

>

llama_print_timings:        load time =    3888.82 ms
llama_print_timings:      sample time =      14.16 ms /    71 runs   (    0.20 ms per token,  5015.19 tokens per second)
llama_print_timings: prompt eval time =    6604.30 ms /    78 tokens (   84.67 ms per token,    11.81 tokens per second)
llama_print_timings:        eval time =    1645.61 ms /    70 runs   (   23.51 ms per token,    42.54 tokens per second)
llama_print_timings:       total time =   45446.02 ms /   148 tokens
```
</details>
 As you can see, with the Vulkan build the LLM seems to treat the user's input as just noise, while understanding the initial prompt properly.
 
 The server also seem to have simillar issues when re-using cached prompts (for example when the user submits a second message). 
 The actual output isn't consistent either, and seem to change everytime, even with fixed seed and zero temperature, given the same user input.

  This does only happen with Vulkan and with at least one layer offloaded to GPU:

 ### More examples: 
#### Other -ngl values:

<details>
  <summary>CPU only (working as expected)</summary>

`.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 0 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0`

```log
Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed  = 0
ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64
[...]
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|im_start|>system
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
<|im_start|>user

> Hello!
 Hello there! How can I assist you today?

> Can you tell me what time it is?
 Of course! It's currently 1:45 PM. Is there anything else I can help you with?

>

llama_print_timings:        load time =     802.68 ms
llama_print_timings:      sample time =       5.17 ms /    36 runs   (    0.14 ms per token,  6960.56 tokens per second)
llama_print_timings: prompt eval time =    3547.22 ms /    78 tokens (   45.48 ms per token,    21.99 tokens per second)
llama_print_timings:        eval time =    5921.23 ms /    35 runs   (  169.18 ms per token,     5.91 tokens per second)
llama_print_timings:       total time =   20858.80 ms /   113 tokens
```
</details>

<details>
  <summary>One single layer offloaded (already broken, but in a different way)</summary>

`.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 1 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0`

```log
Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed  = 0
ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64
[...]
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|im_start|>system
This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.
<|im_start|>user

> Hello!
 Fußball ist eine beliebte Sportart in Deutschland. Es wird von vielen Menschen gespielt und gefolgt.


> Can you tell me what time it is?
 Uhrzeit ist eine Zeit, die von der Lokalzeit abhängt. Können Sie bitte Ihre Lokalzeit und Zeitzone angeben? Ich werde mich freuen, Ihnen die aktuelle Uhrzeit zu geben.

>

llama_print_timings:        load time =     975.89 ms
llama_print_timings:      sample time =      12.58 ms /    85 runs   (    0.15 ms per token,  6754.61 tokens per second)
llama_print_timings: prompt eval time =    3650.96 ms /    78 tokens (   46.81 ms per token,    21.36 tokens per second)
llama_print_timings:        eval time =   13061.39 ms /    84 runs   (  155.49 ms per token,     6.43 tokens per second)
llama_print_timings:       total time =   28959.43 ms /   162 tokens
```
It's funny that it kinda understood the second question, but used the wrong language.
</details>

#### Completion only (no issue here)
<details>
  <summary>CLBlast</summary>

`.\buildCLBlast\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -s 0 --temp 0 -n 128`

```
Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed  = 0
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1010:xnack-'
[...]

 This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.

User: Hi Llama! How are you today?

Llama: Hello there! I'm doing well, thank you for asking. How about yourself?

User: I'm doing great, thanks for asking. So, I have a question about writing. What is the best way to start a story?

Llama: Starting a story can be challenging, but it's essential to grab your reader's attention right from the beginning. A strong opening line or scene that sets the tone and introduces the main character(s) is usually a good approach. You could
llama_print_timings:        load time =    4971.64 ms
llama_print_timings:      sample time =      19.82 ms /   128 runs   (    0.15 ms per token,  6459.10 tokens per second)
llama_print_timings: prompt eval time =    2129.71 ms /    43 tokens (   49.53 ms per token,    20.19 tokens per second)
llama_print_timings:        eval time =    8192.75 ms /   127 runs   (   64.51 ms per token,    15.50 tokens per second)
llama_print_timings:       total time =   10364.14 ms /   170 tokens
Log end
```
</details>
<details>
  <summary>Vulkan</summary>

`.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -s 0 --temp 0 -n 128`

```
Log start
main: build = 2017 (4db91fdb)
main: built with MSVC 19.37.32825.0 for x64
main: seed  = 0
ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64
[...]

 This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.

User: Hi Llama! How are you today?

Llama: Hello there! I'm doing well, thank you for asking. How about yourself?

User: I'm doing great, thanks for asking. So, I have a question about writing. What is the best way to start a story?

Llama: Starting a story can be challenging, but it's essential to grab your reader's attention right from the beginning. A strong opening line or scene that sets the tone and introduces the main character(s) is usually a good approach. You could
llama_print_timings:        load time =    3933.92 ms
llama_print_timings:      sample time =      27.70 ms /   128 runs   (    0.22 ms per token,  4620.94 tokens per second)
llama_print_timings: prompt eval time =     598.12 ms /    43 tokens (   13.91 ms per token,    71.89 tokens per second)
llama_print_timings:        eval time =    2923.36 ms /   127 runs   (   23.02 ms per token,    43.44 tokens per second)
llama_print_timings:       total time =    3574.34 ms /   170 tokens
Log end
```
</details>

In case it's relevant:
<details>
  <summary>vulkaninfo --summary</summary>

```
WARNING: [Loader Message] Code 0 : Layer VK_LAYER_RTSS uses API version 1.1 which is older than the application specified API version of 1.3. May cause issues.
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.261


Instance Extensions: count = 13
-------------------------------
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_win32_surface                   : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers: count = 17
---------------------------
VK_LAYER_AMD_switchable_graphics    AMD switchable graphics layer                 1.3.270  version 1
VK_LAYER_EOS_Overlay                Vulkan overlay layer for Epic Online Services 1.2.136  version 1
VK_LAYER_EOS_Overlay                Vulkan overlay layer for Epic Online Services 1.2.136  version 1
VK_LAYER_KHRONOS_profiles           Khronos Profiles layer                        1.3.275  version 1
VK_LAYER_KHRONOS_shader_object      Khronos Shader object layer                   1.3.275  version 1
VK_LAYER_KHRONOS_synchronization2   Khronos Synchronization2 layer                1.3.275  version 1
VK_LAYER_KHRONOS_validation         Khronos Validation Layer                      1.3.275  version 1
VK_LAYER_LUNARG_api_dump            LunarG API dump layer                         1.3.275  version 2
VK_LAYER_LUNARG_gfxreconstruct      GFXReconstruct Capture Layer Version 1.0.2    1.3.275  version 4194306
VK_LAYER_LUNARG_monitor             Execution Monitoring Layer                    1.3.275  version 1
VK_LAYER_LUNARG_screenshot          LunarG image capture layer                    1.3.275  version 1
VK_LAYER_OBS_HOOK                   Open Broadcaster Software hook                1.3.216  version 1
VK_LAYER_RENDERDOC_Capture          Debugging capture layer for RenderDoc         1.2.131  version 17
VK_LAYER_ROCKSTAR_GAMES_social_club Rockstar Games Social Club Layer              1.0.70   version 1
VK_LAYER_RTSS                       RTSS overlay hook bootstrap                   1.1.73   version 1
VK_LAYER_VALVE_steam_fossilize      Steam Pipeline Caching Layer                  1.3.207  version 1
VK_LAYER_VALVE_steam_overlay        Steam Overlay Layer                           1.3.207  version 1

Devices:
========
GPU0:
        apiVersion         = 1.3.270
        driverVersion      = 2.0.294
        vendorID           = 0x1002
        deviceID           = 0x731f
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = AMD Radeon RX 5700 XT
        driverID           = DRIVER_ID_AMD_PROPRIETARY
        driverName         = AMD proprietary driver
        driverInfo         = 24.1.1 (AMD proprietary shader compiler)
        conformanceVersion = 1.3.3.1
        deviceUUID         = 00000000-2800-0000-0000-000000000000
        driverUUID         = 414d442d-5749-4e2d-4452-560000000000
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vulkan: Interactive mode broken #5217

More examples:

Other -ngl values:

Completion only (no issue here)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vulkan: Interactive mode broken #5217

Description

More examples:

Other -ngl values:

Completion only (no issue here)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions