Reinforcement Learning from Human Feedback (RLHF)

RLHF is the technique that transformed large language models from impressive but erratic text predictors into the helpful assistants you use today. The process has three stages. First, the pretrained model is fine-tuned on high-quality examples of helpful responses (supervised fine-tuning). Second, human evaluators compare pairs of model outputs and indicate which they prefer, and these preferences are used to train a "reward model" that can predict what humans would prefer. Third, the language model is further trained to maximise the reward model's score - essentially learning to produce the kinds of outputs that humans rated highly. This is what teaches models to follow instructions, give balanced answers, acknowledge uncertainty, and decline harmful requests. RLHF is powerful but imperfect. The human evaluators' preferences become baked into the model, including any biases or blind spots they have. The process is expensive and slow because it requires extensive human evaluation. And there's an ongoing tension between helpfulness and safety - training a model to be maximally helpful can conflict with training it to refuse harmful requests. Despite these challenges, RLHF remains the primary method for aligning large language models with human intentions.