Current eviction zeros out evicted positions and uses attention_mask. The model still attends over the full-length tensor (just with zeros masked out). Physically truncating the KV tensors to only kept tokens would: (1) reduce actual GPU memory, (2) speed up attention computation. Challenge: RoPE position IDs break when tokens are removed — need explicit position_ids remapping. Initial test showed catastrophic PPL without the remapping fix.
Current eviction zeros out evicted positions and uses attention_mask. The model still attends over the full-length tensor (just with zeros masked out). Physically truncating the KV tensors to only kept tokens would: (1) reduce actual GPU memory, (2) speed up attention computation. Challenge: RoPE position IDs break when tokens are removed — need explicit position_ids remapping. Initial test showed catastrophic PPL without the remapping fix.