Poster
Understanding the Effects of RLHF on LLM Generalisation and Diversity
Robert Kirk · Ishita Mediratta · Christoforos Nalmpantis · Jelena Luketina · Eric Hambro · Edward Grefenstette · Roberta Raileanu
Halle B
Large language models (LLMs) fine-tuned with reinforcement learning from humanfeedback (RLHF) have been used in some of the most widely deployed AI modelsto date, such as OpenAI’s ChatGPT or Anthropic’s Claude. While there has beensignificant work developing these methods, our understanding of the benefits anddownsides of each stage in RLHF is still limited. To fill this gap, we present anextensive analysis of how each stage of the process (i.e. supervised fine-tuning(SFT), reward modelling, and RLHF) affects two key properties: out-of-distributiongeneralisation (OOD) and output diversity. OOD generalisation is crucial given thewide range of real-world scenarios in which these models are being used, whileoutput diversity refers to the model’s ability to generate varied outputs, and isimportant for a variety of use cases. We perform our analysis across two basemodels on both summarisation and instruction following tasks, the latter beinghighly relevant for current LLM use cases. We find that RLHF generalises betterthan SFT to new inputs, particularly as the distribution shift between train and testbecomes larger. However, RLHF significantly reduces output diversity compared toSFT across a variety of measures, implying a tradeoff in current LLM fine-tuningmethods between generalisation and diversity. Our results provide guidance onwhich fine-tuning method should be used depending on the application, and showthat more research is needed to improve the tradeoff between generalisation anddiversity.