DE EN ES FR HI JA KO NO ZH

Server room with blue lighting representing the computing infrastructure needed for artificial intelligence models

TurboQuant: Google Makes AI 8x Faster with Less Memory

Publié le 07 Avril 2026

Google has just published TurboQuant, a compression algorithm that could well change the game for the entire artificial intelligence industry. Presented at the prestigious ICLR 2026 conference on March 24, this research work led by Amir Zandieh and Vahab Mirrokni from Google Research tackles one of the biggest technical problems of current language models: their outrageous memory consumption. And the results are impressive.

Why AI consumes so much memory

To understand the importance of TurboQuant, one must first grasp what slows AI down today. When a model like GPT, Gemini or Claude generates text, it does not start from scratch with each word. It stores in memory mathematical vectors called key-value (or KV cache) for each word already generated. Concretely, each word of your conversation is converted into a series of decimal numbers (for example 1.29, 0.03, -0.76, 0.91...) stored in 16-bit precision.

The problem? This cache grows linearly with the length of the text. For an 8-billion parameter model with a context of 32,000 tokens, the KV cache alone consumes approximately 4.6 GB of VRAM. Often, it is the cache — and not the model itself — that saturates GPU memory. This bottleneck is exactly what TurboQuant comes to solve.

How TurboQuant works

The algorithm operates in two elegant mathematical steps, each based on solid theoretical foundations.

Step 1: PolarQuant — reorganizing the data

The first step consists of applying a random rotation to the data vectors. By converting classic Cartesian coordinates into polar coordinates (radius + angle), PolarQuant uniformly distributes the energy of each vector across all its components. The result? A predictable statistical distribution that allows optimal quantization via the Lloyd-Max algorithm, without needing to calibrate anything on the target model. This step also eliminates the need to store costly normalization constants in memory.

Step 2: QJL — correcting residual errors

The second step deals with the residual error left by the first compression. The Quantized Johnson-Lindenstrauss (QJL) algorithm projects this error through a mathematical transformation, then keeps only one bit per element: the sign (+1 or -1). This correction makes the estimation of attention scores mathematically unbiased, with almost zero memory overhead.

Numbers that speak for themselves

The performance figures announced by Google Research are remarkable:

6x reduction in KV cache memory without measurable precision loss
Cache compression down to 3 bits per element (vs. 16 bits normally), without any retraining
Speed gains up to 8x on NVIDIA H100 GPU compared to unquantized 32-bit keys
Performance virtually identical to original precision on LongBench, Needle In A Haystack, ZeroSCROLLS, RULER and L-Eval benchmarks

Tests were conducted on Gemma and Mistral models, covering a variety of tasks: question answering, code generation and text summarization.

Three advantages that change the game

No training required. Unlike other compression methods, TurboQuant requires no fine-tuning or calibration dataset. You apply it directly to any existing Transformer model, as is.

Model-agnostic. The algorithm works on any Transformer architecture. No need to adapt it depending on whether you use Gemini, Llama, Mistral or another model.

Data-agnostic. TurboQuant is what is called "data-oblivious": its theoretical guarantees hold regardless of the distribution of the data processed. No specific dataset needed to make it work.

Concrete impact for developers

In practice, TurboQuant makes it possible to run significantly larger models on consumer-grade hardware. By combining 4-bit quantized weights with a 4-bit compressed KV cache, configurations previously unthinkable become viable on a simple gaming graphics card.

The open source community has not waited: several implementations are already available, including versions compatible with HuggingFace, llama.cpp, vLLM and even a standalone Rust library. The code is usable in just a few lines:

The sweet spot is at 4 bits, where quality remains indistinguishable from FP16 on models of 3 billion parameters and above. At 3 bits, slight degradation may appear on small models (less than 1.6 billion parameters).

The consequences for the AI industry

The potential impact goes far beyond the developer world. If TurboQuant becomes widespread — and all signs point to this being the case — AI model inference costs could drop by 50% or more. Cloud services like Google Cloud, AWS or Azure could serve more requests with the same hardware. Semantic search on billion-scale vector databases would become significantly more efficient.

Unsurprisingly, the announcement has already triggered reactions in financial markets: memory chip manufacturers saw their stock prices fall, with investors anticipating reduced demand for high-performance memory. Some analysts even compare the impact to that of DeepSeek in early 2025.

TurboQuant is the kind of technical advance that does not make mainstream headlines, but that silently transforms an entire industry. By compressing the working memory of AI by a factor of 6 to 8, without quality loss and without retraining, Google has potentially just made artificial intelligence much more accessible — and much less expensive to deploy.

Facebook

Twitter

Tumblr

TurboQuant: Google Makes AI 8x Faster with Less Memory

Publié le 07 Avril 2026

Why AI consumes so much memory

How TurboQuant works

The algorithm operates in two elegant mathematical steps, each based on solid theoretical foundations.

Step 1: PolarQuant — reorganizing the data

Step 2: QJL — correcting residual errors

Numbers that speak for themselves

The performance figures announced by Google Research are remarkable:

6x reduction in KV cache memory without measurable precision loss
Cache compression down to 3 bits per element (vs. 16 bits normally), without any retraining
Speed gains up to 8x on NVIDIA H100 GPU compared to unquantized 32-bit keys
Performance virtually identical to original precision on LongBench, Needle In A Haystack, ZeroSCROLLS, RULER and L-Eval benchmarks

Tests were conducted on Gemma and Mistral models, covering a variety of tasks: question answering, code generation and text summarization.

Three advantages that change the game

No training required. Unlike other compression methods, TurboQuant requires no fine-tuning or calibration dataset. You apply it directly to any existing Transformer model, as is.

Model-agnostic. The algorithm works on any Transformer architecture. No need to adapt it depending on whether you use Gemini, Llama, Mistral or another model.

Data-agnostic. TurboQuant is what is called "data-oblivious": its theoretical guarantees hold regardless of the distribution of the data processed. No specific dataset needed to make it work.

Concrete impact for developers

The sweet spot is at 4 bits, where quality remains indistinguishable from FP16 on models of 3 billion parameters and above. At 3 bits, slight degradation may appear on small models (less than 1.6 billion parameters).

The consequences for the AI industry

Facebook

Twitter

Tumblr

TurboQuant: Google Makes AI 8x Faster with Less Memory

Publié le 07 Avril 2026

Why AI consumes so much memory

How TurboQuant works

The algorithm operates in two elegant mathematical steps, each based on solid theoretical foundations.

Step 1: PolarQuant — reorganizing the data

Step 2: QJL — correcting residual errors

Numbers that speak for themselves

The performance figures announced by Google Research are remarkable:

6x reduction in KV cache memory without measurable precision loss
Cache compression down to 3 bits per element (vs. 16 bits normally), without any retraining
Speed gains up to 8x on NVIDIA H100 GPU compared to unquantized 32-bit keys
Performance virtually identical to original precision on LongBench, Needle In A Haystack, ZeroSCROLLS, RULER and L-Eval benchmarks

Tests were conducted on Gemma and Mistral models, covering a variety of tasks: question answering, code generation and text summarization.

Three advantages that change the game

No training required. Unlike other compression methods, TurboQuant requires no fine-tuning or calibration dataset. You apply it directly to any existing Transformer model, as is.

Model-agnostic. The algorithm works on any Transformer architecture. No need to adapt it depending on whether you use Gemini, Llama, Mistral or another model.

Data-agnostic. TurboQuant is what is called "data-oblivious": its theoretical guarantees hold regardless of the distribution of the data processed. No specific dataset needed to make it work.

Concrete impact for developers

The sweet spot is at 4 bits, where quality remains indistinguishable from FP16 on models of 3 billion parameters and above. At 3 bits, slight degradation may appear on small models (less than 1.6 billion parameters).

The consequences for the AI industry

Facebook

Twitter

Tumblr

TurboQuant: Google Makes AI 8x Faster with Less Memory

Why AI consumes so much memory

How TurboQuant works

Step 1: PolarQuant — reorganizing the data

Step 2: QJL — correcting residual errors

Numbers that speak for themselves

Three advantages that change the game

Concrete impact for developers

The consequences for the AI industry

TurboQuant: Google Makes AI 8x Faster with Less Memory

Why AI consumes so much memory

How TurboQuant works

Step 1: PolarQuant — reorganizing the data

Step 2: QJL — correcting residual errors

Numbers that speak for themselves

Three advantages that change the game

Concrete impact for developers

The consequences for the AI industry

TurboQuant: Google Makes AI 8x Faster with Less Memory

Why AI consumes so much memory

How TurboQuant works

Step 1: PolarQuant — reorganizing the data

Step 2: QJL — correcting residual errors

Numbers that speak for themselves

Three advantages that change the game

Concrete impact for developers

The consequences for the AI industry

2026 €100 fuel bonus: who can benefit and how to apply?

2026 heatwave: 40°C crossed in France, weekend forecast

Why certain songs take us back into the past

Father's Day 2026: date and best gift ideas

Tour de France 2026: route, favorites and everything you need to know

Speaking by whistling: the languages that rebalance the brain

« Extraterrestres » is typed entirely with the left hand

TurboQuant: Google Makes AI 8x Faster with Less Memory

Why AI consumes so much memory

How TurboQuant works

Step 1: PolarQuant — reorganizing the data

Step 2: QJL — correcting residual errors

Numbers that speak for themselves

Three advantages that change the game

Concrete impact for developers

The consequences for the AI industry

TurboQuant: Google Makes AI 8x Faster with Less Memory

Why AI consumes so much memory

How TurboQuant works

Step 1: PolarQuant — reorganizing the data

Step 2: QJL — correcting residual errors

Numbers that speak for themselves

Three advantages that change the game

Concrete impact for developers

The consequences for the AI industry

TurboQuant: Google Makes AI 8x Faster with Less Memory

Why AI consumes so much memory

How TurboQuant works

Step 1: PolarQuant — reorganizing the data

Step 2: QJL — correcting residual errors

Numbers that speak for themselves

Three advantages that change the game

Concrete impact for developers

The consequences for the AI industry

Related articles

2026 €100 fuel bonus: who can benefit and how to apply?

2026 heatwave: 40°C crossed in France, weekend forecast

Why certain songs take us back into the past