For this weeks edition of the Sunday Paper I wanted to cover KV Cache compression, a topic that is super valuable for LLM deployment. In order to reduce computation during inference modern transformer models store a complete history of all keys and values computed at each layer in the form of the KV cache. This cache creates a huge memory bottleneck that everyone within AI is trying to solve. As such I was inspired by the Multi-head Latent Attention (MLA) module introduced by DeepSeekV2, a new model which we covered a few weeks ago. For those of you who haven’t read that paper or my previous post, MLA specifically alters the way that keys and values are constructed in attention. This allows for extreme compression of the KV Cache even when stored at full precision. Given this revelation I wanted to cover some state-of-the-art KV cache compression techniques developed for Llamas in order to see the potential limits of this compression.
It is important to understand that there are two main schools of research when it comes to KV cache compression, activation quantization and token eviction. These methods are aptly named but for the sake of being explicit, activation quantization aims to represent the KV cache using fewer overall bits. This often involves finding codebook centroids that can adequately represent the stored keys and values. From that point our cache can then simply maintain a low-bit mapping to our new codebook. Token eviction on the other hand aims to find certain tokens that are not semantically valuable within our stored context. An easy way to think about this is to ask the simple question “Do we really need to store an exact copy of every instance of the word ‘the’ in memory?” The answer to this somewhat rhetorical question should be a likely no, and this revelation is what token eviction aims to exploit. It should also come as no surprise that these methods are orthogonal to each other, this means that I can actually use both a quantization and eviction method at the same time for increased savings. This is why I was truly excited by the MLA architecture as it proposes a third method that can be composed with the two mentioned above. It is quite possible to combine MLA, token eviction, and activation quantization to create a hyper-compressed KV cache, thus removing any concern for memory bottlenecks during LLM inference. With all that being said, let’s get into the papers.
Sunday: GEAR (paper):
GEAR is a KV quantization method that integrates several orthogonal compression schemes to improve efficiency. The first scheme involves outlier removal, where outlier KV activations are masked out and stored separately. This detection is channel-agnostic, filtering out the top and bottom s% from the KV cache. Following this, uniform quantization is applied across the filtered KV cache. The remaining quantization error is then approximated using Singular Value Decomposition (SVD). The authors also introduce a "streaming" strategy for GEAR, where new tokens are stored in a buffer and all token compressions are recomputed once the buffer is full, with minimal overhead. This method works effectively, providing a robust framework for KV quantization.
Monday: 1-bit KV Cache (paper)
In the paper "1-bit KV Cache," the authors explore the independence of KV channels using information theory and entropy approximation. They discover that channels are highly coupled, even when grouped sequentially, leading to the development of a specific quantization method called Coupled Quantization. This coupling is particularly pronounced in the early layers and among cached keys. To implement Coupled Quantization, the model learns channel-group centroids using an offline calibration dataset through either uniform clustering or second-order-information-informed clustering (Fisher Information). Although the Fisher Information method results in higher quantization errors, it better maintains model performance, as measured by perplexity. This quantization method is especially effective when the KV cache is highly quantized (1.25 bits) and has potential for further improvement in channel coupling choices.
Tuesday: KVQuant (paper)
KVQuant is a method designed for the quantization of the KV cache. The authors first discovered that key activations should be quantized before applying RoPE to improve outlier detection. Another notable finding, inspired by the "Attention Sinks" paper, is that the first token in a sequence is often significantly different from subsequent tokens, leading the authors to suggest that this token should not be quantized but stored with minimal memory overhead. They also found that normalizing the centroids to ensure that the de-quantized KV cache maintains the same mean and standard deviation as the pre-quantized KV cache is crucial, especially when performing extreme compression. KVQuant employs channel-wise key compression and entry-wise value compression in a non-uniform manner, achieving significant compression performance. This method can be paired with token dropping methods to further enhance efficiency.
An interesting possibility for further performance gains could be incorporating Coupled Quantization (CQ) for key quantization within KVQuant. This would transform the current channel-wise key compression into a multi-channel compression approach, leveraging CQ's data relationship-based compression alongside KVQuant's strong data engineering strategies. Both methods have shown similar performance, so combining them might lead to even better results.
Wednesday: MiniCache (paper)
MiniCache introduces a novel strategy for compressing the KV cache along the depth dimension, motivated by the observation that adjacent layers generate similar KV tokens. This phenomenon is particularly noticeable in the middle to deep layers of the Transformer, supporting claims from "The Unreasonable Ineffectiveness of Deep Layers" that these layers in modern LLMs can often be pruned. This strategy hints at the potential for a Universal Transformer-like parameter sharing across contiguous layers, resembling grouped query attention but applied across layers. The authors demonstrate that this method can reduce the KV cache by 41% through joint compression of deep layers. However, this approach might conflict with research on grokking, which occurs in deep layers, indicating a potential tension between compression strategies and the goal of enhancing model generalization through grokking. My last note on this paper is the potential motivation that it gives for KV sharing across layers as a model design decision. This would create something in between a Universal Transformer and a Vanilla one, an interesting thought.
Thursday: L2 norm for KV Cache Compression (paper)
In this paper, the authors explore token rejection as a method for reducing the KV cache size and make a surprising discovery: analyzing the L2 norm of a token can determine its importance within the attention computation. This finding leads to a straightforward approach where tokens with high L2 norms are rejected to maintain a fixed buffer size. Remarkably, this method is effective even when discarding up to 50% of the tokens in the cache. The simplicity of this approach suggests potential for enhancement, such as incorporating a "recent-cache" buffer that retains the k most recent tokens, with the remaining N-k tokens filtered out using the L2 norm method. While this technique shows promising results on perplexity, NIAH, and passkey retrieval tasks, further investigation is needed on true downstream tasks to fully understand its efficacy.
Friday: AdaptiveKV (paper)
AdaptiveKV introduces a set of rules for pruning the KV cache. The method retains the KV cache for all special tokens and important punctuation. It enforces locality by discarding tokens beyond a certain distance, which limits long-range attention capabilities and essentially converts the model into a Sliding Window Attention (SWA) mechanism. While the previous step somewhat mitigates this by preserving special tokens, it mainly benefits perplexity evaluations. Additionally, the method uses cumulative attention scores to evict tokens that are infrequently attended to.
A notable innovation of AdaptiveKV is the use of different policies across attention heads, allowing each head to either maintain a full KV cache or apply some of the pruning strategies mentioned above. This approach leads to moderate throughput increases and memory compression. The most significant contribution of this work may be the insight that each attention head should be treated independently, not only in terms of compression but also regarding the compression ratio.
Saturday: PyramidInfer (paper)
In this paper, the authors introduce Pivotal Contexts (PvCs), a subset of the context deemed most useful for the current iteration of next token prediction. Initial experiments focused on retaining these PvCs, revealing that they are crucial for inference. The study found that deeper layers could be compressed significantly more than shallow layers through increased rejection of PvCs. Interestingly, these PvCs were consistent across the temporal dimension, indicating that they carry contextual information for many future tokens, a finding similar to the concept of Attention Sinks.
These insights led to the development of the PyramidInfer compression method, which utilizes the attention scores of recent contexts to limit the size of the overall KV cache, placing it within a buffer of size p. The buffer size decays for deeper layers, allowing for nearly a 50% reduction in the KV cache while doubling throughput. Although the authors did not evaluate this method on downstream tasks, the approach appears highly promising.
Conclusion
I found this week to be truly fascinating, especially considering how recent all of the research was. I think that there is clear room for a composition of methodologies, a part of research that is often under-represented as it does not publish papers or win grants. I think that this composition could truly “solve” all KV cache related bottlenecks for the forseeable future, allowing for memory to be used for more valuable endeavors such as inference speed ups. There are still some papers coming out in this field this summer so I am not yet ready to write this section of research off, but I think we are about to hit diminishing returns (at least for inference).
This post marks the completion of my second month reading a paper every day. I have gained a ton of knowledge through this challenge, and to be honest, I think it is about time for this challenge to end (or at least evolve). As I have read all of these papers I am left with so many research questions and I have very little time to tackle them. I am not yet commiting to anything yet, but I think come the end of July I may cut back on my Sunday Paper content volume in order to investigate some of these questions on the small scale I can. I may even post about those investigations here as a supplement to the papers I will continue to read. I already have a small yet unsuccessful investigation into Hyperbolic Attention that I need to finalize and write up.
If you liked this post please check out my main blog and consider subscribing for free. All of my content is free and will continue to be free. I try to post on my main blog twice a month on Mondays (may change to Wednesdays going forward) and I will aim to post here every Sunday. I like to talk about cutting edge AI research and AI philosophy in a manner that is easy to understand for semi-technical audiences.


