AI Concepts

Quantization

Learn how quantization reduces model memory and serving cost by representing weights or activations with lower precision.

intermediate4 min readUpdated 2026-05-26MechanicsInferenceOperationsTradeoffs
QuantizationPrecisionModel WeightsActivationsMemory FootprintAccuracy Tradeoffs

After this, you will understand

How Quantization helps you see what mechanism is doing the work, what tradeoff it introduces, and where it appears in AI systems.

Beginner version

Start with the word in plain English before adding machinery.

Confusion point

The idea becomes unclear when it is mixed with Quantization, Precision, and Model Weights too early.

Better mental model

Connect the word to inputs, outputs, model behavior, product boundaries, and evaluation.

Think before readingBefore learning the mechanics, what should a beginner understand about Quantization and Precision?
As you read, separate the vocabulary from the implementation details. The word should feel clear before the system design gets complex.

Reading in progress

This page is saved in your local study history so you can continue later.

Next: Distillation

Study path

Read these in order

Start with the mechanics, then move into the patterns that explain why the system is shaped this way.

  1. 1Distillationai-concepts

Concepts Covered

  • Quantization
  • Precision
  • Model weights
  • Activations
  • Memory footprint
  • Inference speed
  • Calibration
  • Quality tradeoffs
  • Weight quantization and activation quantization

Definition

Quantization is the process of representing model numbers with lower precision so the model uses less memory and can often run more cheaply or faster.

The beginner version:

store or compute model values with fewer bits

Instead of keeping every value in a high-precision format, a quantized model may use formats like 8-bit or 4-bit representations for parts of the model.

Why This Concept Exists

Large models are expensive to serve.

They need:

  • memory to store weights
  • memory for runtime state such as activations and caches
  • compute for matrix operations
  • bandwidth to move values through hardware

If a model is too large to fit on available hardware, or too expensive to serve at useful latency, teams look for ways to reduce the serving burden.

Quantization exists because many model values do not always need full precision to preserve useful behavior.

The Beginner Mental Model

A beginner may think:

Smaller numbers mean the same model, just faster.

That is close, but incomplete.

Quantization changes how values are represented. That can reduce memory and improve serving characteristics, but it can also introduce approximation error.

The engineering question is not:

Can we use fewer bits?

It is:

Can we use fewer bits while keeping acceptable quality for this workload?

Precision In Plain English

Precision is about how much detail a numeric representation can carry.

Imagine measuring a temperature:

21.384729 degrees
21.4 degrees
21 degrees

The shorter versions use less detail. They may be good enough for some purposes and too rough for others.

Model quantization makes a similar tradeoff with learned numeric values.

What Can Be Quantized

Different quantization approaches target different parts of model computation.

Common targets include:

  • weights: the learned parameters stored in the model
  • activations: intermediate values produced while the model runs
  • key-value cache tensors during inference

Weight quantization is often the easiest first mental model:

same model structure
weights stored in a lower-precision representation

Activation and cache quantization add more runtime complexity because they touch values created during live inference.

A Small Serving Example

Suppose a model is too large to fit on one GPU in the precision you want.

One option is to buy larger hardware.

Another option is to reduce precision:

full precision weights -> lower precision weights

If quality remains acceptable, the model may fit into memory, serve more users per machine, or reduce cost.

If quality drops too much, the cheaper model is not actually useful.

That is why quantization is always tied to evaluation.

Calibration And Post-Training Quantization

Some quantization methods can be applied after training.

This is often called post-training quantization.

A method may use calibration data to observe value ranges and choose how to map high-precision values into lower-precision ones.

The key idea:

calibration helps choose the lower-precision representation

Poor calibration data can make the quantized model behave worse in production cases.

Quantization-Aware Training

Another path is to train or fine-tune with quantization effects in mind.

That can help the model adapt to lower precision, but it is more involved than loading a model with lower-precision weights.

For an engineer, the useful boundary is:

post-training quantization -> cheaper to apply, may lose quality
quantization-aware training -> more work, can preserve quality better

The right choice depends on quality requirements, hardware, latency targets, and team capability.

Product And Infrastructure Pressure

Quantization matters in real products because serving is not only model quality.

Teams care about:

  • time to first token
  • tokens per second
  • GPU memory
  • batch size
  • concurrent users
  • cost per request
  • quality under real prompts

A quantized model that is slightly weaker but much cheaper may be acceptable for one workflow.

For another workflow, such as medical, legal, or high-stakes coding assistance, the quality loss may be unacceptable.

Common Confusions

Quantization is not the same as compression in the ordinary file-zip sense.

It changes numeric representation used by model weights or runtime values.

Quantization is not fine-tuning.

Fine-tuning changes model parameters through training. Quantization changes how values are represented for storage or computation.

Quantization is not automatically lossless.

Lower precision can change behavior. You need evals.

4-bit is not always better than 8-bit.

Fewer bits can save more memory, but may create more approximation error or require more careful methods.

What This Does Not Mean

Quantization does not make large-model serving free.

It can reduce memory and compute pressure, but the model still needs hardware, batching, cache management, monitoring, and quality evaluation.

It also does not prove the model is production-ready. It only changes the serving tradeoff.

What to study next

These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.