ICLR Poster The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

Poster

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

Michael Zhang · Kush Bhatia · Hermann Kumbong · Christopher Re

Halle B

[ Abstract ]

[ OpenReview]

Abstract:

Linear attentions have shown promise for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length. This holds exciting promise for (1) training linear Transformers from scratch, (2) `inetuned-conversion of task-specific Transformers into linear versions that recover task performance, and (3) pretrained-conversion of Transformers, such as language models, into linear versions readily finetunable on downstream tasks. However, linear attentions often underperform compared to standard softmax attention. To close this performance gap, we study the behaviors of softmax and linear attentions in various train-from-scratch and finetuned-conversion settings. We find prior linear attentions lack key properties of softmax attention tied to good performance: low-entropy (or spiky) weights and dot-product monotonicity. We further observe surprisingly simple feature maps that retain these properties match softmax performance, but are inefficient to compute in linear attention. We thus propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity. Hedgehog uses simple, trainable MLPs to produce attention weights mimicking softmax attention. Experiments show Hedgehog recovers over 99\% of standard Transformer performance in train-from-scratch and finetuned-conversion settings, outperforming prior linear attentions by up to 6 perplexity points on WikiText-103 when training causal GPT models from scratch, and up to 8.7 GLUE score points when converting finetuned bidirectional BERT models. Hedgehog also enables direct pretrained-conversion, achieving a new state-of-the-art WikiText-103 perplexity of 16.7 for 125M decoder-only Transformers by converting pretrained GPT-2 into a linear attention Transformer.

Chat is not available.