Open
Description
What happened?
Prior to PR #9921 / Version 4081 the -ngl 0 Q4_0 llama performance was significantly higher (more than 10x) than afterwards.
(hardware: Apple MacBook Air M2 10 GPU 24GB RAM)
before PR:
make clean
git checkout ae8de6d
make -j llama-bench
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
model | size | params | backend | threads | test | t/s |
---|---|---|---|---|---|---|
llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | pp512 | 60.48 ± 0.49 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | tg128 | 14.89 ± 0.20 |
llama 7B Q4_0_4_4 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | pp512 | 63.50 ± 2.47 |
llama 7B Q4_0_4_4 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | tg128 | 11.93 ± 3.30 |
with ngl 99: | ||||||
llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | pp512 | 194.94 ± 0.07 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | tg128 | 11.81 ± 6.53 |
build: ae8de6d (4080)
versions after PR (including current):
make clean
git checkout 1607a5e
make -j llama-bench
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
model | size | params | backend | threads | test | t/s |
---|---|---|---|---|---|---|
llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | pp512 | 4.11 ± 0.24 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | tg128 | 1.86 ± 0.01 |
llama 7B Q4_0_4_4 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | pp512 | 62.81 ± 2.55 |
llama 7B Q4_0_4_4 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | tg128 | 14.70 ± 1.97 |
with ngl 99: | ||||||
llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | pp512 | 186.02 ± 13.18 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | tg128 | 11.25 ± 3.42 |
build: 1607a5e (4081)
The variations except for -ngl 0 / Q4_0 might be due to the MacBook Air's thermals.
Name and Version
Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: arm64-apple-darwin24.1.0
Thread model: posix
macOS Sequoia 15.1.1
What operating system are you seeing the problem on?
Mac
Relevant log output
No response