Closed
Description
What happened?
Running a model and specifying 8192 context like so:
/llama-server --model Mistral-Large-Instruct-2407-IQ3_XXS.gguf -c 8192 -ngl 35
Causes the following to print during initialization:
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be
utilized
This freaked me out, because based on this discussion, the message implies that I'm actually only getting 4096 context due to parallelization. On the other hand, and I also see:
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 8192 slot reset: id 0 | task -1 |
which is what I would expect.
This discrepancy seems to be due to the fact that the llama.cpp server temporarily increments n_parallel when loading the model (for a reason relating to Mamba? Not sure why we do this).
My concerns are:
- What context is actually being used here? 8192 or 4096?
- Should this be considered a bug, since the messages essentially contradict each other?
Please let me know if any other information is needed, but this should be easy to replicate. Thanks!
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 4033 (a9e8a9a0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response