Reinforcement Learning from Human Feedback (RLHF)

Explore further

Go Wider

Adaptation & Alignment

How AI Works

Reinforcement Learning

How AI Works

Required By

Safety, Alignment & Existential Risk

Governance, Power & Society

Implemented By

Alignment Research & Techniques

Governance, Power & Society

RLHF is the technique that transformed large language models from impressive but erratic text predictors into the helpful assistants you use today. The process has three stages. First, the pretrained model is fine-tuned on high-quality examples of helpful responses (supervised fine-tuning). Second, human evaluators compare pairs of model outputs and indicate which they prefer, and these preferences are used to train a "reward model" that can predict what humans would prefer. Third, the language model is further trained to maximise the reward model's score - essentially learning to produce the kinds of outputs that humans rated highly. This is what teaches models to follow instructions, give balanced answers, acknowledge uncertainty, and decline harmful requests. RLHF is powerful but imperfect. The human evaluators' preferences become baked into the model, including any biases or blind spots they have. The process is expensive and slow because it requires extensive human evaluation. And there's an ongoing tension between helpfulness and safety - training a model to be maximally helpful can conflict with training it to refuse harmful requests. Despite these challenges, RLHF remains the primary method for aligning large language models with human intentions.