Description
The imatrix
tool, which computes an "importance matrix" that can be used to improve quantization accuracy, currently only works when run on the CPU, which is quite slow. In addition, when llama.cpp
is built with CUDA support enabled, the call to the data collection function is bypassed, and one gets an empty result, which is inconvenient and leads to confusion.
Also, given the discussions around PRs #4897, #4861, #4856, #4773, where importance matrix capabilities were added to llama.cpp
, there appears to be a lot of interest in experimenting with different training dataset to create the importance matrix. But experimentation is difficult with the much lower CPU performance compared to the GPU.
So, overall, it would be very useful to support importance matrix calculations on faster back-ends (CUDA, Metal, etc.).