Poster
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Samyak Jain · Robert Kirk · Ekdeep Singh Lubana · Robert Dick · Hidenori Tanaka · Tim Rocktaeschel · Edward Grefenstette · David Krueger
Halle B
Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. Despite its clear importance, there has been little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just inhibit existing ones? An answer to this question would improve our ability to trust fine-tuning protocols meant to improve the safety of pre-trained models and delete unsafe capabilities. We aim to make progress on this question by answering it in controlled settings where we can use mechanistic interpretability tools (e.g.~ network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an exhaustive analysis of the effects of fine-tuning in these settings, and show: (i) the ubiquitous protocol of fine-tuning with a small learning rate rarely alters the underlying model capabilities; (ii) often a minimal transformation, which we call a wrapper, is learned on top of the underlying model capability, yielding the impression that a new capability has been learned or a prior capability has been deleted; and (iii) continuing the fine-tuning process on a task where the pretraining capabilities are relevant leads to sample-efficient ``revival'' of the capability, i.e., the model starts to accurately reuse that capability in just a few gradient steps. \textit{This potentially indicates a practitioner could unintentionally render a safe model to be unsafe by merely fine-tuning on a downstream task.} We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a realistic setting.