Reward Model

RM

A model trained to predict human preferences, used in RLHF pipelines to guide the training of language models toward generating outputs that align with human values.

In Plain Language

An AI trained to judge what good responses look like, based on human preferences. It scores the main AI's outputs; "this response is helpful" or "this response is harmful"; guiding the AI to improve.