Skip to content

GPU kernels for E8 nearest-point (Triton) #2

@jagmarques

Description

@jagmarques

The E8 nearest-point computation currently runs on CPU (~60s for 3500 tokens on Mistral-7B). This is the latency bottleneck. The algorithm is embarrassingly parallel (independent per 8-dim group) and maps naturally to a Triton kernel. Target: <100ms for the full compression pipeline on A100.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePerformance improvements

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions