Google has just published TurboQuant, a compression algorithm that could well change the game for the entire artificial intelligence industry. Presented at the prestigious ICLR 2026 conference on March 24, this research work led by Amir Zandieh and Vahab Mirrokni from Google Research tackles one of the biggest technical problems of current language models: their outrageous memory consumption. And the results are impressive.
Why AI consumes so much memory
To understand the importance of TurboQuant, one must first grasp what slows AI down today. When a model like GPT, Gemini or Claude generates text, it does not start from scratch with each word. It stores in memory mathematical vectors called key-value (or KV cache) for each word already generated. Concretely, each word of your conversation is converted into a series of decimal numbers (for example 1.29, 0.03, -0.76, 0.91...) stored in 16-bit precision.
The problem? This cache grows linearly with the length of the text. For an 8-billion parameter model with a context of 32,000 tokens, the KV cache alone consumes approximately 4.6 GB of VRAM. Often, it is the cache — and not the model itself — that saturates GPU memory. This bottleneck is exactly what TurboQuant comes to solve.
How TurboQuant works
The algorithm operates in two elegant mathematical steps, each based on solid theoretical foundations.
Step 1: PolarQuant — reorganizing the data
The first step consists of applying a random rotation to the data vectors. By converting classic Cartesian coordinates into polar coordinates (radius + angle), PolarQuant uniformly distributes the energy of each vector across all its components. The result? A predictable statistical distribution that allows optimal quantization via the Lloyd-Max algorithm, without needing to calibrate anything on the target model. This step also eliminates the need to store costly normalization constants in memory.
Step 2: QJL — correcting residual errors
The second step deals with the residual error left by the first compression. The Quantized Johnson-Lindenstrauss (QJL) algorithm projects this error through a mathematical transformation, then keeps only one bit per element: the sign (+1 or -1). This correction makes the estimation of attention scores mathematically unbiased, with almost zero memory overhead.
Numbers that speak for themselves
The performance figures announced by Google Research are remarkable:
- 6x reduction in KV cache memory without measurable precision loss
- Cache compression down to 3 bits per element (vs. 16 bits normally), without any retraining
- Speed gains up to 8x on NVIDIA H100 GPU compared to unquantized 32-bit keys
- Performance virtually identical to original precision on LongBench, Needle In A Haystack, ZeroSCROLLS, RULER and L-Eval benchmarks
Tests were conducted on Gemma and Mistral models, covering a variety of tasks: question answering, code generation and text summarization.
Three advantages that change the game
No training required. Unlike other compression methods, TurboQuant requires no fine-tuning or calibration dataset. You apply it directly to any existing Transformer model, as is.
Model-agnostic. The algorithm works on any Transformer architecture. No need to adapt it depending on whether you use Gemini, Llama, Mistral or another model.
Data-agnostic. TurboQuant is what is called "data-oblivious": its theoretical guarantees hold regardless of the distribution of the data processed. No specific dataset needed to make it work.
Concrete impact for developers
In practice, TurboQuant makes it possible to run significantly larger models on consumer-grade hardware. By combining 4-bit quantized weights with a 4-bit compressed KV cache, configurations previously unthinkable become viable on a simple gaming graphics card.
The open source community has not waited: several implementations are already available, including versions compatible with HuggingFace, llama.cpp, vLLM and even a standalone Rust library. The code is usable in just a few lines:
The sweet spot is at 4 bits, where quality remains indistinguishable from FP16 on models of 3 billion parameters and above. At 3 bits, slight degradation may appear on small models (less than 1.6 billion parameters).
The consequences for the AI industry
The potential impact goes far beyond the developer world. If TurboQuant becomes widespread — and all signs point to this being the case — AI model inference costs could drop by 50% or more. Cloud services like Google Cloud, AWS or Azure could serve more requests with the same hardware. Semantic search on billion-scale vector databases would become significantly more efficient.
Unsurprisingly, the announcement has already triggered reactions in financial markets: memory chip manufacturers saw their stock prices fall, with investors anticipating reduced demand for high-performance memory. Some analysts even compare the impact to that of DeepSeek in early 2025.
TurboQuant is the kind of technical advance that does not make mainstream headlines, but that silently transforms an entire industry. By compressing the working memory of AI by a factor of 6 to 8, without quality loss and without retraining, Google has potentially just made artificial intelligence much more accessible — and much less expensive to deploy.
English
French
Spanish
Chinese
Japanese
Korean
Hindi
German
Norwegian