Reinforcement Learning from Human Feedback

RLHF

A technique for training AI models using human preferences as a reward signal, commonly used to align large language models with human values and instructions.

In Plain Language

Training AI by having humans rate its responses. The AI generates answers, humans say which are better and the AI learns from that feedback; like training a new employee through performance reviews.

Why This Matters

RLHF is relevant to governance because it is the primary method used to align commercial AI systems with human values. Understanding how the AI models your organisation uses were trained helps you assess their risk profile and governance needs.

All