AI & MLrlhf

RLHF (Reinforcement Learning from Human Feedback)

Also known asRLHF

A training technique that aligns LLM outputs with human preferences. Process: (1) train a reward model from human comparisons of outputs, (2) use reinforcement learning (PPO) to optimize the LLM against the reward model. RLHF makes models more helpful, harmless, and honest. Used by Claude, ChatGPT, and other assistants. Alternatives include DPO (Direct Preference Optimization) and Constitutional AI.

Decode this term

Related terms

AI & ML

LLM (Large Language Model)

A neural network trained on vast text corpora to understand and generate human language. LLMs (GPT-4, Claude, Llama, Gem...

AI & ML

Training (ML)

The process of optimizing a model's parameters by exposing it to data and adjusting weights to minimize a loss function....