Skip to content

Server: reuse cached tokens for shifted prompt #5793

Closed
@ngxson

Description

@ngxson

Motivation

Currently, cached tokens is reused in server by doing common_part(new_tokens, cached_tokens)

This is good in the situation where all incoming requests have the same prefix:

cached_tokens  a b c d e f g h i
new_tokens     a b c d e f x y z
reused_tokens  x x x x x x

However, if the input is shifted (for example, old messages in the conversation is dropped). In this case, number of reused token is reduced:

cached_tokens  a b c d e f g h i
new_tokens     a b c g h i k l m
reused_tokens  x x x

Proposal

My proposal is to detect such case and uses llama_kv_cache_seq_rm + llama_kv_cache_seq_add to shift the tokens in cache accordingly.

cached_tokens  a b c d e f g h i
shifted_cache  a b c g h i
new_tokens     a b c g h i k l m
reused_tokens  x x x x x x

I already tested this kind of behavior on my side. It works well, but the catch is that it only works with one single "conversation". Also, I have no idea if have negative impacts if being done frequently (i.e. fragmenting the cache?) @ggerganov

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions