Quantisation

Quantisation reduces the numerical precision of a model's weights - the millions or billions of numbers that define what the model has learned. During training, these weights are typically stored as 32-bit or 16-bit floating-point numbers, which provides high precision but uses lots of memory. Quantisation converts them to lower precision formats: 8-bit, 4-bit or even lower. The analogy is like rounding prices to the nearest pound instead of tracking pennies - you lose a tiny bit of precision but save enormous amounts of space. A model quantised from 16-bit to 4-bit weights uses roughly a quarter of the memory, which means it can run on much cheaper hardware or serve more users simultaneously. Modern quantisation techniques are remarkably good at preserving quality: a well-quantised model is often nearly indistinguishable from the original in practical use. Some approaches quantise during training for even better results, while others apply quantisation after training for convenience. For businesses, quantisation is one of the most important efficiency techniques because it directly affects deployment costs and hardware requirements. It's why you can run surprisingly capable AI models on a laptop or smartphone, and why API providers can offer competitive pricing despite the enormous underlying models.