Physically remove evicted tokens instead of zeroing + masking

Current eviction zeros out evicted positions and uses attention_mask. The model still attends over the full-length tensor (just with zeros masked out). Physically truncating the KV tensors to only kept tokens would: (1) reduce actual GPU memory, (2) speed up attention computation. Challenge: RoPE position IDs break when tokens are removed — need explicit position_ids remapping. Initial test showed catastrophic PPL without the remapping fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Physically remove evicted tokens instead of zeroing + masking #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Physically remove evicted tokens instead of zeroing + masking #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions