-
Notifications
You must be signed in to change notification settings - Fork 12.2k
llama : initial Mamba-2 support #9126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
* ggml : improve ggml_mul speed when masking recurrent states
* ggml : make the ggml_mul fast broadcast path more consistently formatted
e9b0d19
to
aff9692
Compare
Hey @compilade , thanks for implementing this! I tried converting https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1 using
Nevertheless, I successfully converted a Mamba-Codestral Run it output model (remember to select the correct chat template, since the model does not come with one):
The result looks promising, but I have no idea why there are
Link to download GGUF: https://huggingface.co/ngxson/codestral-mamba-llamacpp-test/tree/main |
The steps I took to convert Mamba-Codestral-7B-v0.1 are the following:
I did not have tokenization problems in my tests. Maybe because I was using the original SentencePiece tokenizer instead of a BPE tokenizer. That There are probably still problems with the SentencePiece tokenizer too, I think the SentencePiece tokenizer should be preferred for this model; it should be easier to handle without workarounds. I should change that in |
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
Thanks for the guide! I've successfully converted the original repository the gguf by following your steps. For the I'm wondering if (Also cc @Vaibhavs10 since he's the maintainer of gguf-my-repo.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @compilade/ @ngxson - JFYI - the transformers weights are now merged in the main repo: https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1
If you face any issues with the conversion with this could you open an issue on the repo for us to track! 🤗
Any updates on when Codestral Mamba should be supported? |
Nice work! Just a note on the ssm_scan kernel performance: a better fused implementation by the flash-linear-attention project can give the equivalent functionality as Mamba2's original kernel: fla-org/flash-linear-attention#49 , and runs 2x faster: fla-org/flash-linear-attention#50 |
Hi @compilade ! I worked on repo conversion for the transformers-compatible mamba2 version, let us know if you need anything from us to move forward with this PR :) |
It sounds like having a simple fallback of expected filenames would be a reasonable thing to include here? I don't know that we want to maintain a ton of different ones, but adding a second layer of fallbacks for alternate filenames doesn't feel arduous. |
That's not really a problem anymore (at least for Mamba-Codestral) since the official repo was updated in https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1/commit/88085f9cdfa832c3aca8a0315a4520cf7558c947 to use more standard names. What is currently blocking this is that the Metal and CUDA kernels for |
Any updates on this? |
I took a whack at syncing this with the latest merges to The broken symptom is that when I run a I also rebased Granite 4 on top of this broken point and the model did generate tokens, so the |
Works, but using lambda functions might not be that clean.
@gabe-l-hart I did not have the same problem when merging |
Awesome, thanks for finding that. I must have fat fingerd a merge resolution. I'll give it another go and see if I can get the Granite 4 branch working again. |
I've now got a temporary branch to sync this with the Hybrid Recurrent Cache (#13979): https://github.com/gabe-l-hart/llama.cpp/tree/mamba2-sync This seems to work correctly for (cc @younesbelkada since I think you were trying to get to the same merge point of hybrid + mamba2) |
@compilade I did also rebase the Granite 4 branch once I verified that my sync-point branch is working for non-hybrid models. Something isn't working correctly and is showing the same symptom I saw earlier where the model runs, but produces garbage. The last point in history where this wasn't happening was e94f39, so I'm going to look into the changes to My broken Granite 4 branch is here in case you get a moment to glance at it: https://github.com/gabe-l-hart/llama.cpp/tree/broken/GraniteFour |
This turned out to be a bug in the hybrid recurrent cache, so it's resolved now. @compilade with your latest changes to avoid copies, I now see Also, I'm seeing some interesting performance characteristics between # simple run metal
./bin/llama-cli -m ~/models/mamba2-370m-hf/ggml-model-Q4_K_M.gguf --temp 0 -p "Tell me a story about a developer and their dog" -no-cnv -n 100
# llama_perf_sampler_print: sampling time = 2.41 ms / 110 runs ( 0.02 ms per token, 45681.06 tokens per second)
# llama_perf_context_print: load time = 141.28 ms
# llama_perf_context_print: prompt eval time = 61.99 ms / 10 tokens ( 6.20 ms per token, 161.31 tokens per second)
# llama_perf_context_print: eval time = 1748.05 ms / 99 runs ( 17.66 ms per token, 56.63 tokens per second)
# llama_perf_context_print: total time = 1815.04 ms / 109 tokens
# simple run CPU
./bin/llama-cli -m ~/models/mamba2-370m-hf/ggml-model-Q4_K_M.gguf --temp 0 -p "Tell me a story about a developer and their dog" -no-cnv -n 100 -ngl 0
# llama_perf_sampler_print: sampling time = 2.84 ms / 110 runs ( 0.03 ms per token, 38773.35 tokens per second)
# llama_perf_context_print: load time = 99.32 ms
# llama_perf_context_print: prompt eval time = 33.33 ms / 10 tokens ( 3.33 ms per token, 300.01 tokens per second)
# llama_perf_context_print: eval time = 1009.74 ms / 99 runs ( 10.20 ms per token, 98.04 tokens per second)
# llama_perf_context_print: total time = 1049.48 ms / 109 tokens
# batched-bench run metal
./bin/llama-batched-bench -m ~/models/mamba2-370m-hf/ggml-model-Q4_K_M.gguf -c 2048 -b 2048 -ub 512 -npp 10 -ntg 128 -npl 1
# llama_perf_context_print: load time = 167.10 ms
# llama_perf_context_print: prompt eval time = 125.31 ms / 26 tokens ( 4.82 ms per token, 207.48 tokens per second)
# llama_perf_context_print: eval time = 2233.30 ms / 128 runs ( 17.45 ms per token, 57.31 tokens per second)
# llama_perf_context_print: total time = 2444.29 ms / 154 tokens
# batched-bench run CPU
./bin/llama-batched-bench -m ~/models/mamba2-370m-hf/ggml-model-Q4_K_M.gguf -c 2048 -b 2048 -ub 512 -npp 10 -ntg 128 -npl 1 -ngl 0
# llama_perf_context_print: load time = 108.84 ms
# llama_perf_context_print: prompt eval time = 55.39 ms / 26 tokens ( 2.13 ms per token, 469.37 tokens per second)
# llama_perf_context_print: eval time = 1006.83 ms / 128 runs ( 7.87 ms per token, 127.13 tokens per second)
# llama_perf_context_print: total time = 1139.26 ms / 154 tokens My only thought is that somehow the CPU version requires some kind of "warm up" beyond the basic warm up call where it's slower and that |
@gabe-l-hart That's correct, it's not yet updated. I did not really work on it enough since I wrote that. But I'll be working on making/finishing a working version of the CUDA kernel for Mamba2 in the next days. It probably won't be the fastest it can be, so performance optimizations will be very welcome. If I don't give some progress update in the next 24 hours, please ping me.
Yes, it could be related to the CPU clock boosting after an increase in workload, and the run in |
Awesome, I super appreciate it. I'll ask around internally and look for folks that might be interested in helping once the initial update is done.
Fascinating, that would make a lot of sense. IIRC, I also saw something like this with |
@compilade @younesbelkada I've now updated |
There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
@gabe-l-hart I've implemented a working CUDA kernel for the Mamba-2 ssm scan in f8c7cae, and I've also fixed the Mamba-1 CUDA kernel (which I had previously broken by changing the shapes and reducing the copies with the extra I think this is mostly complete. (unless there's a remaining problem) I will probably extract 0b6f6be in its own PR, though, since it's not specific to Mamba-2. |
Amazing, thank you @compilade! My CUDA box managed to become unreachable yesterday and the sysadmin is out for the week, so I'll test on my end as soon as I can get access to a machine. |
Hi @compilade @gabe-l-hart - thank you again for this work, I wanted to follow up about the status of this PR and see what needs to be done before merging it and if I can do anything to help merging it 🙏 thank you again ! |
I think that is all? (I'll add more if missing) |
thank you @compilade ! test conversion of Mamba-2 models |
I was able to successfully convert codestral 7B and mamba2-2.7b locally and run inference while getting coherent Commands used for conversion (after git cloning locally the repos): mamba2-2.7b
codestral mamba-7b
Output results: Command used is below:
Mamba2-2.7B
Codestral-Mamba7b:
--> codestral 7b tends to repeat a lot of tokens when it's done generating, I found some community issues complaining about the same while not using llama.cpp: https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1/discussions/23 so it might be some tokenizer mis-configuration issue which is not related to the pure mamba2 modeling code. Since the official mamba2 seems to give some positive signal with much less repetition, we should probably be good on conversion side. I can do more extensive tests if you suspect some commits might created some divergence/ Note I am testing everything on Apple Metal, so I can also take: test inference of Mamba-2 (and Mamba-1) models
How were you intended to test things? I can take that one and do some tests on Apple Metal |
It's OK to make breaking changes to SSM_SCAN. I'm pretty sure these have very small adoption (if any) at this point. |
Thanks for all the testing @younesbelkada! I'm still working on reviving my CUDA box and can test there once it's live. @compilade when you test new model support, do you have a system for comparing activations against the corresponding |
Follow-up from #8519 (comment). This should fix #7727 and fix #8519.
I've implemented the fully recurrent mode of Mamba-2, because it's very similar to Mamba-1, and also because it seems like the most appropriate mode for text generation.
This does not implement the sequentially semistructured matrix mode, because I'm not yet sure how the block decomposition would fit within the
batch
andubatch
framework ofllama.cpp
, and how the chunk size should be chosen. If the recurrent mode is faster at single-user auto-regressive text generation, then I'm not sure how to keep the graph node structure constant when using the most appropriate technique for the batch size.If the sequentially semistructured matrix mode is eventually implemented, it should help with prompt processing speed for large prompts.
What to expect
(mostly taken from #8519 (comment))
The state in Mamba-2 is bigger than I thought; Mamba-Codestral-7B-v0.1 takes
263.5 MiB
(inF32
) per sequence (e.g. with-np 1
), compared to38 MiB
(also inF32
) for Falcon-Mamba-7B (which is based on Mamba-1). But that remains constant whatever the context size. Mamba-2 is easier to implement efficiently, so the bigger state does not really impede inference speed.However, a big downside right now with recurrent models in
llama.cpp
is the lack of state rollback (which is implemented through state checkpoints in #7531, but needs to be re-adapted to #8526), so the prompt will be reprocessed a lot if usingllama-server
. I think usingllama-cli
in conversation mode does not have this problem, however (or maybe only the bare interactive mode with--in-prefix
and--in-suffix
, not sure).This initial implementation is CPU-only, but uses SIMD for the SSM scan, so even though the state is bigger than for Mamba-1 models, in my tests, the speed of
Mamba2-130M
is similar or better thanMamba-130M
(but still not that fast compared to transformer-based models with an empty context), when both are run on CPU.The speed of Mamba-2 models seems comparable to Transformer-based models when the latter have 2k to 4k tokens in their context.
Summary of changes
Mamba2ForCausalLM
(including the official Mamba-2 models, and Mamba-Codestral-7B-v0.1)config.json
needs to contain"architectures": ["Mamba2ForCausalLM"],
for the convert script to properly detect the architecture.d_inner
(aka2 * n_embd
) heads of size 1.ggml_ssm_scan
ggml
ggml_ssm_scan
.ssm_a
is broadcast)ssm_d
intoggml_ssm_scan
GGML_SIMD
.expf
in the state update unlike with Mamba-1.ggml_ssm_scan
.perf
.Other
Here's my favorite quote from Section 3.3 of https://arxiv.org/abs/2405.21060:
TODO
master
after merging llama : simplify Mamba with advanced batch splits #8526.ggml_ssm_scan
GGML_MUL
fast broadcast path because it's not used anymore to mask the states.Maybe use a new metadata key instead of(well, maybe kind of).{arch}.ssm.time_step_rank
for the number of heads of Mamba-2, because it's not really the rank of the time stepssm_d
inggml_ssm_scan
?ggml_ssm_scan
to separate the implementations for Mamba-1 and Mamba-2, although they do have a lot in common.