Knowledge Distillation

Explore further

Knowledge distillation is the process of training a small "student" model to mimic the behaviour of a large "teacher" model. Rather than training the student directly on the original data, you run the teacher on that data and use its outputs - including the nuanced probability distributions over possible answers, not just the final answers - as training targets for the student. This works remarkably well because the teacher's outputs contain richer information than the raw training data. When a teacher model assigns 70% probability to the correct translation and 25% to a reasonable alternative, that distribution teaches the student something about the structure of the problem that a simple "correct answer" label wouldn't convey. The result is a small model that performs far better than it would have if trained from scratch - it effectively inherits some of the larger model's sophistication. Distillation is how many production AI applications work: the research lab trains an enormous model, then distills its capabilities into something small and fast enough for real-world deployment. For businesses, this is relevant because the model you're actually using is often a distilled version of a much larger model, and understanding this helps explain why different tiers of the same provider's service offer different quality-speed trade-offs.