Closed
Description
New research just came out on using a technique inspired by kernel virtual memory and pages to manage the KV cache.
Results? Way faster inference!
https://vllm.ai/
They claim up to 24x the throughput (measured in requests handled per second) compared to huggingface's transformers library
How?
Inference is bottlenecked by memory, most notably the KV cache. They say the KV cache's most notable features are
- That it's very large
- That it's dynamic, size depends on sequence length which is variable. Existing systems waste 60-80% of memory due to fragmentation and over-reservation
PagedAttention is an alternative approach to managing the KV cache which is inspired by virtual memory, pages and blocks. By allocating the space dynamically with this approach, only up to about 4% of memory will be wasted, instead of the aforementioned 60-80.
For further details, better refer to their website and Github.