Poster
Harnessing Overlap in Blockwise Transformers for Near-Infinite Context
Hao Liu · Matei Zaharia · Pieter Abbeel
Halle B
Transformers have emerged as the architecture of choice for for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving extended sequences or long-term dependencies. We present a distinct approach, Ring Attention, which leverages blockwise computation of self-attention to distribute long sequences across multiple devices while concurrently overlapping the communication of key-value blocks between devices through blockwise attention computation. By processing longer input sequences while maintaining memory efficiency, Ring Attention enables training and inference of sequences that exceed 100 million tokens in length, allowing length to scale proportionally with the number of devices, effectively eliminating the memory constraints imposed by individual devices. Extensive experiments on language modeling tasks demonstrate the effectiveness of Ring Attention in reducing memory requirements and improving performance.