llama : mitigate KV cache fragmentation

With the new unified KV cache implementation from #3228 we now support batched decoding.

At runtime, depending on the workload and the length of the decoded sequences, the KV cache can become fragmented. If we think of the cache as an array and each cell being free, or belonging to a certain sequence, then we could end up with many short segments of free cells, instead of one big segment with all the free cells. This hinders the performance since for each new batch, we have to find a free cache segment that can "hold" the entire batch and when we cannot do so, we have to start splitting the batch in smaller batches to be able to fit it.

One possible mitigation is from time to time (based on some logic, or based on user request) to "defragment" the cache. This can be implemented in a very similar way to the existing KV shift functionality, where we add extra graph nodes when there is need to defragment the cache.

Other approaches might be possible. For example, based on the batch size, allocate the KV data either at the start or at the end of the cache. This way we will keep the larger "prompt" segments next to each other at the start of the cache and the small "text" segments at the end of the cache. Not sure if this would be lead to less fragmentation, but similar strategies can be explored



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : mitigate KV cache fragmentation #3380

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama : mitigate KV cache fragmentation #3380

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions