Google Research is claiming something that usually gets dismissed as marketing copy: a compression method that cuts memory usage dramatically with zero accuracy loss. TurboQuant, set to appear at ICLR 2026, quantizes the key-value cache of large language models down to 3 bits without retraining, fine-tuning, or measurable degradation across five standard long-context benchmarks.
That's a strong claim. Let's look at what it's actually doing.
How the two-stage pipeline works
TurboQuant combines two component algorithms: PolarQuant (accepted at AISTATS 2026) and QJL (Quantized Johnson-Lindenstrauss, published at AAAI 2025).
The KV cache problem is straightforward. As context windows grow, the high-dimensional vectors stored in the key-value cache consume enormous GPU memory. Quantizing those vectors is the obvious solution, but standard quantization introduces its own cost: you have to store quantization constants (scale and zero-point values) in full precision for every small data block. That overhead typically adds 1-2 bits per number, partially negating your savings.
PolarQuant sidesteps this with a geometric insight. Instead of working in Cartesian coordinates, where the data distribution varies and you need per-block normalization, it converts vectors to polar coordinates after applying a random rotation. The rotation ensures the angular distribution becomes tightly concentrated and analytically predictable. Because the distribution shape is known in advance, the model can skip data normalization entirely. No normalization means no quantization constants to store. PolarQuant compresses the KV cache by over 4.2x on its own, according to the arxiv preprint, while achieving better quality scores than prior state-of-the-art methods.
QJL then handles the residual. After PolarQuant encodes the main signal using most of the available bits, roughly 1 bit of budget remains for error correction. QJL applies a Johnson-Lindenstrauss Transform to the quantization residual, reducing each number to a single sign bit: +1 or -1. The zero-memory-overhead claim is real here: sign bits require no stored scaling constants. A specialized estimator then uses the high-precision query vector against this 1-bit representation to produce an unbiased attention score.
The pipeline: PolarQuant captures the bulk of the signal at high quality, QJL eliminates residual bias using one bit. Together they hit 3-bit total compression.
What the benchmarks actually show
The team evaluated across LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval using Gemma and Mistral. Those are standard long-context benchmarks covering question answering, code generation, and summarization. The comparison baseline is KIVI, a prior KV quantization method from ICML 2024.
TurboQuant claims perfect downstream scores on the Needle in a Haystack tests at 6x memory reduction. Across the aggregated benchmarks, both TurboQuant and PolarQuant show near-lossless performance against the KIVI baseline.



