Applied Research: KV Collapse Profiler — Free Compression in Any AI Model
— By Christopher Lynch
Every AI model has hundreds of small decision-makers inside it. Some of them stopped doing useful work — but they're still using memory. We built a tool that finds them in 4 minutes on a laptop. Open source. No GPU required.
How a pragmatic engineer approaches an inference efficiency problem — and what turned up
Bottom Line
Every AI language model — ChatGPT, Llama, Qwen, Mistral — is built from hundreds of small, specialized components called attention heads. Each head has one job: decide which parts of the input text matter for generating the next word. Some heads track grammar. Some track meaning across long passages. Some figure out what "it" refers to in a sentence.
Each head maintains a memory cache (called a KV cache — "key-value" cache) that stores what it's seen so far. This cache is the single biggest consumer of GPU memory when running a model.
Here's the problem: some of these heads have stopped doing useful work. They produce nearly identical outputs regardless of what you feed them. But they're still allocated the same memory as heads doing complex, meaningful work.
We built a profiler that identifies these dead-weight heads in 4 minutes on a MacBook — two lightweight passes over diverse text, no special datasets required. The heads it flags are candidates for extreme compression — in our tests, reducing one head's memory by 2x with zero measurable quality impact, and by 500-680x with quality better than baseline.
The tool is open source on GitHub. It runs on consumer hardware. No GPU. No training data.
Why you should care: If you run AI models, you're paying for GPU memory that some of these heads don't need. This tool tells you which heads those are, for your specific model, in 4 minutes. That's money back on the table.
A Quick Primer: How Attention Heads Use Memory
When an AI model generates text, every attention head stores two things for each word it processes:
A key — think of it as a label that says "here's what I'm about." Other words use this label to decide whether to pay attention to this word. A value — the actual content that gets passed forward if another word does pay attention.
Together, all these keys and values form the KV cache. The bigger the conversation, the bigger the cache. The bigger the cache, the more GPU memory you need — and the more you pay.
Current compression tools treat every head the same: compress all keys to the same precision, regardless of whether a head is doing critical work or sitting idle. That's like giving every employee the same computing budget — from your lead engineer to someone who only checks email.
The Gap Between Research and Reality
Researchers have known for a while that attention heads aren't all equal. DuoAttention (MIT, ICLR 2025) built a framework to identify which heads specialize — but it requires 2,000 optimization steps on expensive A100 GPUs.
Google's TurboQuant (2025) compresses the KV cache with elegant math — shrinking every head to 3-4 bits. It's generating huge momentum, with implementations coming to every major inference framework. But it also applies uniformly — every head gets the same treatment.
The question we explored: Can you figure out which heads are already doing nothing useful — where TurboQuant is spending 3-4 bits on memory that could go to 1-2 bits — in minutes on a MacBook, with no training data?
Yes.
What We Found
We tested three different AI models from different labs, measuring how much each attention head's key outputs varied across different inputs. If a head produces nearly identical keys no matter what text you give it, that head's "routing" decision has effectively flatlined — it's paying attention to the same thing every time.
The results:
The "collapse score" measures how similar a head's key outputs are across different inputs. 98% means that head produces nearly identical keys no matter what text you give it — its decision-making has almost completely flatlined. 55% means the head still varies meaningfully but shows significant redundancy.
Three architectures. Different labs. Different training data. Different designs. The pattern held across all of them.
One consistent finding: value outputs never collapse. Only keys. The "content" signal stays alive — it's the "routing" signal (deciding what to pay attention to) that goes dormant. These heads have learned to always look at the same positions regardless of input, making their key memory redundant.
For the most collapsed head in Qwen2.5 (Layer 15, Head 1), we replaced the full key memory with a single averaged representation plus a small residual. Quality impact: ±0.007 — measurement noise. That's 2x compression with zero meaningful quality loss. At maximum compression (500-680x), quality was actually better than baseline — the averaged representation acts as a denoiser, cleaning up what was already noise.
Being Honest About Prior Art
The observation that some heads produce low-variation keys is in the research literature. Loki (2024), EigenAttention (2024), and others have documented it. DuoAttention found similar head specialization.
What didn't exist in the literature or any production system: A lightweight, training-free tool that runs on consumer hardware and outputs a ranked map of which heads can be aggressively compressed — in minutes, with no special datasets, no GPU.
DuoAttention requires 2,000 optimization steps on A100 GPUs. Our profiler requires two short passes on a MacBook with fixed diverse text. They answer related but different questions. The tools are complementary.
Methodology Note
This section is for ML practitioners. Skip it if you just want the practical takeaway.
The profiler hooks into the key projection layer () and measures pre-RoPE key vectors. In standard transformer attention, rotary position embeddings (RoPE) are applied after key projection, and the rotated keys are what get cached. This means our measurement is a proxy for cached key redundancy, not a direct measurement of it.
Why the signal is still actionable: RoPE rotates vectors by position, which increases geometric spread. If keys are already collapsed before RoPE, the routing signal in that head is architecturally weak — the head has learned to attend to fixed patterns regardless of content. RoPE doesn't create information where none exists. The 98% collapse we measured persists despite the geometric scattering RoPE introduces, which is why the perplexity measurements in Phase 3B show negligible impact from aggressive compression of these heads.
What You Can Do With This Today
Works right now: Run the profiler on any HuggingFace model (Llama, Qwen, Mistral, SmolLM, etc.) — get a map of which heads are candidates for compression Use that map to inform per-head calibration in vLLM (which supports per-head quantization) Run the turboquantplus fork for TurboQuant on Apple Silicon — uniform 3-bit KV cache at full speed Understand your model's internal efficiency profile, even if you're using uniform compression today
Not yet possible: Automatically feeding this profile to llama.cpp for per-head compression — llama.cpp's flag currently applies to every head equally. Per-head configuration is in active development. Fully automated per-head compression in any single production framework
Coming (estimated Q2-Q3 2026): Google's official TurboQuant implementation (expected to land in major frameworks) Per-head KV cache configuration in llama.cpp When those arrive, the collapse profile from this tool is exactly the input they need
The profiler isn't waiting for the pipeline. It's building the map the pipeline will use.
The Tool
— single file, standard dependencies.
Run it once per model. The JSON output tells you which attention heads are candidates for aggressive compression. 4 minutes on Apple Silicon.
Open source on GitHub →
What This Is Really About
The point of this work isn't to claim a research discovery. Prior work covers the underlying phenomenon.
The point is to show how we work: we understand the technology at the implementation level, identify where research hasn't made it into production yet, and build the tool that closes the gap. Ship it. Open-source it. Let anyone run it.
We don't just read about AI advances — we run the experiments, measure what's real, and turn findings into tools that solve operational problems.
If you're deploying AI and want a partner who understands the full technology stack and how it best applies to your business — that's the conversation we're built for.
Christopher Lynch is the founder and CEO of Intuitive Context LLC, an AI agent orchestration practice based in Jacksonville, FL. IC builds production AI systems across legal, insurance, travel, real estate, and financial verticals.*