Direct Preference Optimisation (DPO)

Explore further

DPO emerged as a simpler alternative to RLHF that achieves similar results with less complexity. The key insight is that you can skip the separate reward model entirely. Instead of training a reward model on human preferences and then using reinforcement learning to optimise against it, DPO directly adjusts the language model's parameters using the preference data. You still need pairs of outputs labelled as "better" and "worse," but the training pipeline is streamlined: one step instead of three. This makes DPO faster to implement, cheaper to run and easier to debug when things go wrong. The results are competitive with RLHF for many applications, though the two approaches have different strengths - RLHF can sometimes produce more nuanced behaviour because the reward model captures preference patterns that DPO's more direct approach might miss. For businesses considering model customisation, DPO is significant because it lowers the barrier to alignment work. You can incorporate human preference data into your models more easily, enabling faster iteration. Several open-source models have been aligned using DPO, contributing to the growing accessibility of well-behaved AI models beyond the major providers.