Skip to content

Feature Request: Ability to pack multiple GGUFs into single one #13028

Open
@ngxson

Description

@ngxson

Feature Description

From an idea brought up by @ggerganov in this discussion: #11139 (reply in thread)

While it is NOT a good idea to pack both mmproj + text models (because vision support is still messy atm), we still have some interesting use cases:

  • For TTS models, this can be useful because some models may requires more than 2 GGUFs to run (for ex. Sesame CSM requires backbone, decoder and Mimi models)
  • For phi-4-mm model, while the mmproj can't be packed, it is still interesting to pack the LoRA adapters and the text model together
  • There are some techniques which use LoRA to recover quality loss due to quantization, it can be useful to pack LoRA with the model (though, I don't know how effective this can be, cc @compilade )
  • Some models having more than 1 modality (i.e.Phi-4-mm with both audio+vision input), so could be useful to pack audio encoder and vision encoder into single GGUF

Motivation

I create this issue to discuss about possible implementation

Possible Implementation

An implementation could be to have "namespace" for KV metadata and tensor name, then have a "super" key for the list of namespaces

For example, with the case of Sesame CSM, given 2 GGUFs: backbone and decoder, the routine to pack these 2 GGUFs is as follow:

  • We create a blank GGUF
  • Add metadata general.namespaces = ["backbone", "decoder"]
  • Copy all metadata + tensors from backbone while adding backbone. prefix to the key name
  • Copy all metadata + tensors from decoder while adding decoder. prefix to the key name

These APIs will need to be added into libllama:

  • int32_t llama_model_n_namespaces(llama_model * model): returns the number of namespaces, 0 meaning no namespace
  • const char ** llama_model_list_namespaces(llama_model * model): returns the list of namespace as strings
  • llama_model * llama_model_get_namespace(int idx): returns the sub llama_model * object corresponding to a namespace index

Problems

  1. For existing models (like TTS), how to we make a smooth transition to the new packed format? Or probably accept breaking changes since not many people are using it anyway?
  2. How can we design the API such that it implies the least change to user code?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions