llama : try to avoid context swap

Currently, when the context becomes full, we pick part of the tokens and recompute the KV cache.

Instead, try to either:
- store non-RoPEd KV cache, "shift" it when the context is full and compute the RoPE over the entire cache for every new token taking into account the current positions
- store RoPEd KV cache (as we do now), "shift" it when the context is full and apply extra shift-RoPE on it (assuming RoPE is "additive")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : try to avoid context swap #2060

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama : try to avoid context swap #2060

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions