Pruning

Pruning removes parts of a neural network that contribute little to its performance - the AI equivalent of cutting dead branches from a tree. Research has consistently shown that large neural networks contain significant redundancy: many connections carry negligible weight and can be removed without meaningfully affecting the output. The challenge is identifying which parts to remove. Unstructured pruning removes individual weights wherever they're least important, which can eliminate 90% or more of parameters but produces irregular patterns that are difficult for hardware to accelerate efficiently. Structured pruning removes entire neurons, layers or attention heads, producing a genuinely smaller network that runs faster on standard hardware. The process typically involves training the full model, identifying the least important components, removing them and then retraining briefly to recover any lost quality. For practical deployment, pruning and quantisation often work together: prune first to remove unnecessary complexity, then quantise the remaining weights for further compression. The combination can reduce a model's size and computational requirements dramatically. Pruning matters most when you need to deploy models on constrained hardware or when inference costs at scale are a significant concern - it trades a small amount of quality for meaningful operational savings.