Skip to content

Physically remove evicted tokens instead of zeroing + masking #5

@jagmarques

Description

@jagmarques

Current eviction zeros out evicted positions and uses attention_mask. The model still attends over the full-length tensor (just with zeros masked out). Physically truncating the KV tensors to only kept tokens would: (1) reduce actual GPU memory, (2) speed up attention computation. Challenge: RoPE position IDs break when tokens are removed — need explicit position_ids remapping. Initial test showed catastrophic PPL without the remapping fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePerformance improvements

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions