Skip to content

llama : mitigate KV cache fragmentation #3380

Closed
@ggerganov

Description

@ggerganov

With the new unified KV cache implementation from #3228 we now support batched decoding.

At runtime, depending on the workload and the length of the decoded sequences, the KV cache can become fragmented. If we think of the cache as an array and each cell being free, or belonging to a certain sequence, then we could end up with many short segments of free cells, instead of one big segment with all the free cells. This hinders the performance since for each new batch, we have to find a free cache segment that can "hold" the entire batch and when we cannot do so, we have to start splitting the batch in smaller batches to be able to fit it.

One possible mitigation is from time to time (based on some logic, or based on user request) to "defragment" the cache. This can be implemented in a very similar way to the existing KV shift functionality, where we add extra graph nodes when there is need to defragment the cache.

Other approaches might be possible. For example, based on the batch size, allocate the KV data either at the start or at the end of the cache. This way we will keep the larger "prompt" segments next to each other at the start of the cache and the small "text" segments at the end of the cache. Not sure if this would be lead to less fragmentation, but similar strategies can be explored

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions