Closed
Description
Motivation
Currently, cached tokens is reused in server by doing common_part(new_tokens, cached_tokens)
This is good in the situation where all incoming requests have the same prefix:
cached_tokens a b c d e f g h i
new_tokens a b c d e f x y z
reused_tokens x x x x x x
However, if the input is shifted (for example, old messages in the conversation is dropped). In this case, number of reused token is reduced:
cached_tokens a b c d e f g h i
new_tokens a b c g h i k l m
reused_tokens x x x
Proposal
My proposal is to detect such case and uses llama_kv_cache_seq_rm
+ llama_kv_cache_seq_add
to shift the tokens in cache accordingly.
cached_tokens a b c d e f g h i
shifted_cache a b c g h i
new_tokens a b c g h i k l m
reused_tokens x x x x x x
I already tested this kind of behavior on my side. It works well, but the catch is that it only works with one single "conversation". Also, I have no idea if have negative impacts if being done frequently (i.e. fragmenting the cache?) @ggerganov