Closed
Description
Currently, when the context becomes full, we pick part of the tokens and recompute the KV cache.
Instead, try to either:
- store non-RoPEd KV cache, "shift" it when the context is full and compute the RoPE over the entire cache for every new token taking into account the current positions
- store RoPEd KV cache (as we do now), "shift" it when the context is full and apply extra shift-RoPE on it (assuming RoPE is "additive")