Open
Description
Feature Description
From an idea brought up by @ggerganov in this discussion: #11139 (reply in thread)
While it is NOT a good idea to pack both mmproj + text models (because vision support is still messy atm), we still have some interesting use cases:
- For TTS models, this can be useful because some models may requires more than 2 GGUFs to run (for ex. Sesame CSM requires backbone, decoder and Mimi models)
- For phi-4-mm model, while the mmproj can't be packed, it is still interesting to pack the LoRA adapters and the text model together
- There are some techniques which use LoRA to recover quality loss due to quantization, it can be useful to pack LoRA with the model (though, I don't know how effective this can be, cc @compilade )
- Some models having more than 1 modality (i.e.Phi-4-mm with both audio+vision input), so could be useful to pack audio encoder and vision encoder into single GGUF
Motivation
I create this issue to discuss about possible implementation
Possible Implementation
An implementation could be to have "namespace" for KV metadata and tensor name, then have a "super" key for the list of namespaces
For example, with the case of Sesame CSM, given 2 GGUFs: backbone and decoder, the routine to pack these 2 GGUFs is as follow:
- We create a blank GGUF
- Add metadata
general.namespaces = ["backbone", "decoder"]
- Copy all metadata + tensors from backbone while adding
backbone.
prefix to the key name - Copy all metadata + tensors from decoder while adding
decoder.
prefix to the key name
These APIs will need to be added into libllama
:
int32_t llama_model_n_namespaces(llama_model * model)
: returns the number of namespaces, 0 meaning no namespaceconst char ** llama_model_list_namespaces(llama_model * model)
: returns the list of namespace as stringsllama_model * llama_model_get_namespace(int idx)
: returns the subllama_model *
object corresponding to a namespace index
Problems
- For existing models (like TTS), how to we make a smooth transition to the new packed format? Or probably accept breaking changes since not many people are using it anyway?
- How can we design the API such that it implies the least change to user code?