Description
While playing with implementing compression for copy/save state, I found a bug, which turned out to be reproducible in current main
(41aee4d)
It seems to be model independent, and no parameters other than -ngl
seem to make a difference either.
The first symptom happens for save-load-state
, main
and server
, when -ngl
equal to exactly N-1 is specified, basically this happens (generated output):
Hello there!###############################
Second symptom was found by accident, when fiddling with save-load-state
for the purpose of implementing compression. Basically, if -ngl
is N or bigger (all layers loaded),
The problem above, seems to disappear, however:
Not only save-load-state
fails because generated text is different for both runs,
but also, after some tokens were sampled llama_copy_state_data
outputs mostly empty array, which I only noticed because I tried to dump the state post generation, and suddenly started to get 99% compression ratio on that array. Because it turned out to be mostly zeroes.
All -ngl
values between 0 - (N-2) work properly.
I have no way of testing on AMD so I do not know if it's Nvidia specific.
As a sanity check, here are results for -ngl
from 0 to N with the same model and parameters (except -ngl
):
Edit: Interestingly enough, perplexity looks fine ?
-ngl N-2 (27/29)
[1]5.2069,[2]5.1932,[3]5.1802,[4]5.2837,[5]5.2742,[6]5.0776,
Final estimate: PPL = 5.0776 +/- 0.25768
-ngl N-1 (28/29)
[1]5.2069,[2]5.1932,[3]5.1802,[4]5.2837,[5]5.2742,[6]5.0776,
Final estimate: PPL = 5.0776 +/- 0.25768
-ngl N (29/29)
[1]5.2077,[2]5.1813,[3]5.1687,[4]5.2820,[5]5.2682,[6]5.0756,
Final estimate: PPL = 5.0756 +/- 0.25766