Skip to content

Unable to get a response in interactive mode #1423

Closed
@re11ding

Description

@re11ding

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [x ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [x ] I carefully followed the README.md.
  • [x ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I am expecting AI responses to my responses, allowing for a 2-way conversation.

Current Behavior

Once it's my turn to provide a prompt and I press enter, the CPU will reach around 30% and then never generate a response at any point, no longer how long it's left to run. I'm always forced to sigint using Ctrl+C in order to terminate llama.cpp

I've also tried it with 7B, but the result is sadly still the same.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using, e.g. for Linux:

Intel(R) Core(TM) i7-6820HK CPU @ 2.70GHz with 32GB of RAM at 2400MHz

  • Operating System, e.g. for Linux:

Windows 10 v1909

  • SDK version, e.g. for Linux:
Python 3.10.4
GNU Make 4.4
G++ (GCC) 13.1.0

Steps to Reproduce

run parameters as per usual, attempt to respond, and simply wait. --keep is not necessary, that was merely the result of my last test run to see if it changed anything.

Failure Logs

E:\LLaMA\llama.cpp>main -m models/30B/ggml-model-q4_0.bin -n -1 -c 2048 -i -r "User:" --color --keep -1 --prompt "Hello!How are you? Please answer in less than 5 words."
main: build = 0 (unknown)
main: seed  = 1683935758
llama.cpp: loading model from models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 127.27 KB
llama_model_load_internal: mem required  = 21695.48 MB (+ 3124.00 MB per state)
llama_init_from_file: kv self size  = 3120.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: 'User:'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 16


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 Hello!How are you? Please answer in less than 5 words.
I'm ok,how are you? Answer please in less than five words.
Good question. Here is my answer: 'How am I ? '
i'm doing great, what about u?
Not good at all! How are you?
Hi, whats up? how are you?
I am fine thanks! And you?
User:I'm doing great, thank you!

llama_print_timings:        load time = 19669.43 ms
llama_print_timings:      sample time =    57.24 ms /    75 runs   (    0.76 ms per run)
llama_print_timings: prompt eval time = 17329.88 ms /    16 tokens ( 1083.12 ms per token)
llama_print_timings:        eval time = 86146.71 ms /    75 runs   ( 1148.62 ms per run)
llama_print_timings:       total time = 850995.00 ms

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions