Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems

I have temporary access to a dual Epyc Turin system and found a little trick that restores normal token generation performance in llama.cpp on dual Epyc systems. The trick is to load and cache the model in memory while doing token generation, not the prompt processing. You can use llama-bench for this.

First drop caches as root:

`echo 3 > /proc/sys/vm/drop_caches`

and then run llama-bench with only the generation benchmark:

`llama-bench --numa distribute -t <number of threads> -m <model> -r 1 -p 0`

Then use llama.cpp as usual (but don't drop caches to keep the model loaded in memory). Of course you have to pass the same `--numa distribute -t <number of threads>` arguments to llama-cli or llama-server.

On the tested system it increased the token generation rate by 80% (dual Epyc 9175F, 16 x DDR5 6400 MT/s RAM, Llama-3.1-70B-Instruct model, f16, tg increased from 2.4 t/s to 4.31 t/s)

Let me know if it works for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems #11744

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems #11744

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions