LQER: Low-rank quantization error reconstruction for LLMs

Cheng Zhang, Jianyi Cheng, George A. Constantinides, Yiren Zhao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-base iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using 1.36$\times$ fewer hardware resources than the leading state-of-the-art method. We open-source our framework at https://github.com/ChengZhang-98/lqer
Original languageEnglish
Title of host publicationProceedings of the 41st International Conference on Machine Learning
EditorsRuslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, Felix Berkenkamp
PublisherPMLR
Volume235
Publication statusPublished - 27 Jul 2024
EventThe 41st International Conference on Machine Learning - Vienna, Austria
Duration: 21 Jul 202427 Jul 2024
https://icml.cc/

Conference

ConferenceThe 41st International Conference on Machine Learning
Abbreviated titleICML 2024
Country/TerritoryAustria
CityVienna
Period21/07/2427/07/24
Internet address

Fingerprint

Dive into the research topics of 'LQER: Low-rank quantization error reconstruction for LLMs'. Together they form a unique fingerprint.

Cite this