Closed
Description
Motivation
be able to make a split where the first shard is very small and contains primarily the metadata so that it can be downloaded quickly and then start the download of the other shards without waiting for the first to finish
Proposition
Add an option to not include tensor data in the first file. Maybe it should be enabled by default.
Should be well tested.
ggml_alloc
should not be called as it will complain with WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_malloc!
We can add extra meta data in the first file that describes all tensors in the shards for example
References
- How to use the `gguf-split` / Model sharding demo #6404
- gguf-split: split and merge gguf per batch of tensors #6135
- llama_model_loader: support multiple split/shard GGUFs #6187
- common: llama_load_model_from_url split support #6192
- split: allow --split-max-size option #6343
- split: allow --split-max-size option #6343 (comment)
- split: allow --split-max-size option #6343 (comment)
- GGUF: missing
split.no
metadata huggingface/huggingface.js#604