Bug: Severe Performance Degradation on Q4_0 CPU-only with MacOS / Apple Silicon M2, after PR#9921 / Version 4081

### What happened?

Prior to PR #9921 / Version 4081 the -ngl 0 Q4_0 llama performance was significantly higher (more than 10x) than afterwards.
(hardware: Apple MacBook Air M2 10 GPU 24GB RAM)

before PR:
make clean
git checkout ae8de6d
make -j llama-bench
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Metal,BLAS |       4 |         pp512 |         60.48 ± 0.49 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Metal,BLAS |       4 |         tg128 |         14.89 ± 0.20 |
| llama 7B Q4_0_4_4              |   3.56 GiB |     6.74 B | Metal,BLAS |       4 |         pp512 |         63.50 ± 2.47 |
| llama 7B Q4_0_4_4              |   3.56 GiB |     6.74 B | Metal,BLAS |       4 |         tg128 |         11.93 ± 3.30 |
with ngl 99:
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Metal,BLAS |       4 |         pp512 |        194.94 ± 0.07 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Metal,BLAS |       4 |         tg128 |         11.81 ± 6.53 |

build: ae8de6d5 (4080)

versions after PR (including current):
make clean 
git checkout 1607a5e 
make -j llama-bench
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Metal,BLAS |       4 |         pp512 |          **4.11 ± 0.24** |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Metal,BLAS |       4 |         tg128 |          **1.86 ± 0.01** |
| llama 7B Q4_0_4_4              |   3.56 GiB |     6.74 B | Metal,BLAS |       4 |         pp512 |         62.81 ± 2.55 |
| llama 7B Q4_0_4_4              |   3.56 GiB |     6.74 B | Metal,BLAS |       4 |         tg128 |         14.70 ± 1.97 |
with ngl 99:
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Metal,BLAS |       4 |         pp512 |       186.02 ± 13.18 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Metal,BLAS |       4 |         tg128 |         11.25 ± 3.42 |

build: 1607a5e5 (4081)

The variations except for -ngl 0 / Q4_0 might be due to the MacBook Air's thermals.

### Name and Version

Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: arm64-apple-darwin24.1.0
Thread model: posix
macOS Sequoia 15.1.1

### What operating system are you seeing the problem on?

Mac

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Severe Performance Degradation on Q4_0 CPU-only with MacOS / Apple Silicon M2, after PR#9921 / Version 4081 #10435

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	60.48 ± 0.49
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	14.89 ± 0.20
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	63.50 ± 2.47
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	11.93 ± 3.30
with ngl 99:
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	194.94 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	11.81 ± 6.53

Bug: Severe Performance Degradation on Q4_0 CPU-only with MacOS / Apple Silicon M2, after PR#9921 / Version 4081 #10435

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions