Description
Hi all
Hugging Face has a max file size limit of 50GB, which is a bit annoying. This means it's not possible to upload a q8_0 GGML of a 65B model, or a float16 GGML for a 30B model.
I've had two people ask me to upload q8_0's for my 65B uploads. One of them asked if I could use another file sharing site like Google Drive or something like that. But the other mentioned the possibility of multi-part GGMLs.
I know that llama.cpp used to support multi-part models? It still shows n_parts 1
in the header, implying that it might support 2 parts as well?
So I'd love to know:
- Does llama.cpp still support multi-part GGMLs?
- And if so, should it be fairly straightforward to modify convert.py to create one?
Here's the method convert.py uses to write the GGML file:
@staticmethod
def write_all(fname_out: Path, params: Params, model: LazyModel, vocab: Vocab) -> None:
check_vocab_size(params, vocab)
of = OutputFile(fname_out)
of.write_file_header(params)
print("Writing vocab...")
of.write_vocab(vocab)
def do_item(item: Tuple[str, LazyTensor]) -> NDArray:
name, lazy_tensor = item
return lazy_tensor.load().to_ggml().ndarray
ndarrays = bounded_parallel_map(do_item, model.items(), concurrency=8)
for i, ((name, lazy_tensor), ndarray) in enumerate(zip(model.items(), ndarrays)):
size = ' x '.join(f"{dim:6d}" for dim in lazy_tensor.shape)
padi = len(str(len(model)))
print(f"[{i+1:{padi}d}/{len(model)}] Writing tensor {name:38s} | size {size:16} | type {lazy_tensor.data_type}")
of.write_tensor_header(name, lazy_tensor.shape, lazy_tensor.data_type)
ndarray.tofile(of.fout)
of.fout.close()
Would it just be a case of writing the file header twice, and then just putting the first X layers in the first file, and the rest in the other?
What about the vocab - would that go in both files, or only in the first?
Thanks in advance for any info!