Poster
LQ-LoRA: Low-rank plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
Han Guo · Philip Greengard · Eric Xing · Yoon Kim
Halle B
We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component, which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Across standard NLP benchmarks, our low-rank plus quantized matrix decomposition approach (LQ-LoRA) is found to perform well against strong baselines.