Papers
arxiv:2412.10319

SCBench: A KV Cache-Centric Analysis of Long-Context Methods

Published on Dec 13, 2024
· Submitted by iofu728 on Dec 16, 2024
Authors:
,
,
,

Abstract

Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBench(SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With it, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on 8 long-context LLMs. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios. https://aka.ms/SCBench.

Community

Paper author Paper submitter

🖼️ SCbench: A KV cache-centric Analysis for Long-Context Methods

Previous long-context benchmarks only focus on single-turn, but actually most real world long-context scenarios is multi-turn using KV cache reuse. We propose SCBench, which using a KV cache-centric perspective to analysis different long-context methods, including KV Cache Generation, Compression, Retrieval, and Loading. And SCBench including 12 tasks 4 long-context capabilities with two shared context modes (e.g. Multi-Turn, Multi-Request). Based on that, we got some insights:

  1. Sub-(O) memory is almost infeasible in multi-turn decoding;
  2. Task performance shows varying decline trends;
  3. All long-context methods experience performance degradation as the budget decreases;
  4. Long-generation scenarios exhibit distribution shift issues.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.10319 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.10319 in a Space README.md to link it from this page.

Collections including this paper 7