AI Concepts

Distillation

Understand how knowledge distillation trains a smaller or cheaper model to imitate useful behavior from a larger teacher model.

intermediate4 min readUpdated 2026-05-26MechanicsModelingInferenceTradeoffs

DistillationKnowledge DistillationTeacher ModelStudent ModelSoft TargetsDeployment Tradeoffs

After this, you will understand

How Distillation helps you see what mechanism is doing the work, what tradeoff it introduces, and where it appears in AI systems.

Beginner version

Start with the word in plain English before adding machinery.

Confusion point

The idea becomes unclear when it is mixed with Distillation, Knowledge Distillation, and Teacher Model too early.

Better mental model

Connect the word to inputs, outputs, model behavior, product boundaries, and evaluation.

Think before readingBefore learning the mechanics, what should a beginner understand about Distillation and Knowledge Distillation?

As you read, separate the vocabulary from the implementation details. The word should feel clear before the system design gets complex.

Reading in progress

This page is saved in your local study history so you can continue later.

Next: Mixture Of Experts

Study path

Read these in order

Start with the mechanics, then move into the patterns that explain why the system is shaped this way.

1Mixture Of Expertsai-concepts

Concepts Covered

Distillation
Knowledge distillation
Teacher model
Student model
Soft targets
Model compression
Deployment tradeoffs
Quality evaluation

Definition

Distillation is a training approach where a smaller or cheaper model learns from the behavior of a larger, stronger, or more expensive model.

The plain-English version:

teacher model shows useful behavior
student model trains to imitate enough of it

The goal is often to make a deployable model that keeps much of the teacher's usefulness while being cheaper, faster, smaller, or easier to serve.

Why This Concept Exists

The best model for quality may not be the best model for production.

A large model can be:

expensive to run
slow for a latency-sensitive product
too large for edge devices
hard to batch efficiently
unnecessary for simpler tasks

Distillation exists because teams often want some behavior from a powerful model in a smaller serving package.

The student does not become the same model. It learns from examples of the teacher's behavior. The student may imitate useful output patterns, but it has its own architecture, capacity, limits, and failure modes.

Better:

distillation transfers behavior signals, not the teacher's entire mind

Teacher And Student

The teacher is the model that provides the learning signal.

The student is the model being trained.

In a product setting:

large teacher -> high quality, expensive
smaller student -> cheaper, faster, maybe weaker

The teacher may produce labels, rankings, probabilities, explanations, traces, or output examples depending on the distillation setup.

The student trains on those signals and tries to reproduce the behavior that matters.

Soft Targets

In ordinary supervised training, a target may be a hard label:

support ticket -> billing

Distillation can use richer signals from the teacher.

For example, the teacher may indicate:

billing: 0.72
account access: 0.18
technical issue: 0.08
other: 0.02

Those probabilities carry more information than only saying billing.

They reveal what the teacher found plausible and what it rejected. That extra structure can help the student learn a smoother decision boundary.

A Small Product Example

Imagine a company has a high-quality support classifier built with a large model.

The product needs to classify millions of short support messages cheaply.

The team might:

run the large teacher on many representative messages
store the teacher's outputs
train a smaller student on those outputs
evaluate the student against real support outcomes
deploy the student for the high-volume path

The teacher may still be used for harder cases, audits, or new data generation.

Distillation vs Fine-Tuning

Fine-tuning adapts a model through additional training on a task or data.

Distillation uses another model's behavior as part of the training signal.

They can overlap.

A team may fine-tune a student using teacher-generated outputs. The distinction is the source of the supervision:

fine-tuning -> train on task data
distillation -> train from teacher behavior

Why Distillation Needs Evals

Distillation can preserve useful behavior, but it can also preserve mistakes.

The student may:

imitate teacher bias
lose rare-case behavior
become overconfident
handle simple cases well and hard cases poorly
fail on production data that was missing from the distillation set

This is why distillation should be tied to task evals, safety checks, and production monitoring.

The question is not:

Did the student imitate the teacher?

It is:

Does the student meet the product contract at the cost and latency we need?

Product And Infrastructure Pressure

Distillation is useful when a product has a clear serving reason:

lower cost per request
lower latency
offline or edge deployment
smaller specialized model for a narrow task
faster fallback path
cheaper batch processing

It is less useful when the product truly needs the full capability of the larger model or when the task keeps changing faster than the student can be refreshed.

Common Confusions

Distillation is not a perfect clone.

The student can fail differently from the teacher.

Distillation is not the same as quantization.

Quantization changes numeric representation. Distillation trains a model from another model's behavior.

Distillation is not only for LLMs.

The idea applies broadly across machine learning, though it is very relevant to modern AI serving.

A distilled model still needs evaluation.

Teacher quality does not automatically guarantee student quality.

What This Does Not Mean

Distillation does not remove the need for data, objectives, or product design.

It gives another way to produce a deployable model, but the team still needs representative examples, careful evaluation, and a clear definition of acceptable behavior.

What to study next

These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.

Prerequisites

Read these first if the mechanics feel unfamiliar.

QuantizationStart here if Quantization is still fuzzy.Supervised vs Unsupervised vs Self-Supervised LearningStart here if Supervised vs Unsupervised vs Self-Supervised Learning is still fuzzy.