Description
I have temporary access to a dual Epyc Turin system and found a little trick that restores normal token generation performance in llama.cpp on dual Epyc systems. The trick is to load and cache the model in memory while doing token generation, not the prompt processing. You can use llama-bench for this.
First drop caches as root:
echo 3 > /proc/sys/vm/drop_caches
and then run llama-bench with only the generation benchmark:
llama-bench --numa distribute -t <number of threads> -m <model> -r 1 -p 0
Then use llama.cpp as usual (but don't drop caches to keep the model loaded in memory). Of course you have to pass the same --numa distribute -t <number of threads>
arguments to llama-cli or llama-server.
On the tested system it increased the token generation rate by 80% (dual Epyc 9175F, 16 x DDR5 6400 MT/s RAM, Llama-3.1-70B-Instruct model, f16, tg increased from 2.4 t/s to 4.31 t/s)
Let me know if it works for you.