Poster
On input-dependence and recall in convolutional language models
Simran Arora · Sabri Eyuboglu · Aman Timalsina · Isys Johnson · Michael Poli · James Y Zou · Atri Rudra · Christopher Re
Halle B
Convolution-based language models are asymptotically more efficient than Transformers and recent work shows they are competitive in quality. To better understand the relative language modeling quality of these architectures, we pre-train a suite of 14 language models across attention and convolution-based architectures, finding that the SoTA gated convolution architectures still underperform Transformers by up to 2.1 perplexity points on the Pile. Our analysis shows that a single language modeling capability, termed associative recall (AR) — output the next token using the prior context, e.g. Hakuna Matata means no worries Hakuna Matata it means no → ?? — accounts for 76% of the perplexity gap on average. We show the issue arises because the convolution-based models process sequences using fixed filters that do not depend on the input data, making it difficult to handle a variable number of input-specific recall distances (e.g. 4 tokens between instances of Hakuna vs. 5 between worries above). Theoretically, our core contributions are precise bounds for solving AR, applying to the entire class of gated convolution models, that show dimensionality scaling in sequence length. Meanwhile, attention enables tokens separated by any distance to interact and solves AR with model dimension independent of sequence length. We present (1) a concise synthetic AR task, on which we validate the theoretically predicted scaling holds, and (2) a series of architectural modifications, theoretically and empirically showing that they enable solving AR with improved scaling. Our analysis motivates a set of strong baseline models that outperform Transformers at 150M and 355M parameters. We release all checkpoints and code for future analysis.