Revisiting block-based quantisation: What is important for sub-8-bit LLM inference?

Cheng Zhang, Jianyi Cheng, Ilia Shumailov, George A. Constantinides, Yiren Zhao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has emerged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers. Block quantisations efficiently reduce the numerical scaling offsets solely from an arithmetic perspective, without additional treatments in the computational path. Our nearly-lossless quantised 6-bit LLMs achieve a 19× higher arithmetic density and 5× memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by 2.5× in arithmetic density and 1.2× in memory density, without requiring any data calibration or re-training. We also share our insights into sub-8-bit LLM quantisation, including the mismatch between activation and weight distributions, optimal fine-tuning strategies, and a lower quantisation granularity inherent in the statistical properties of LLMs. The latter two tricks enable nearly-lossless 4-bit LLMs on downstream tasks. Our code is open-sourced.
Original languageEnglish
Title of host publicationProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
EditorsHouda Bouamor, Juan Pino, Kalika Bali
PublisherAssociation for Computational Linguistics
Pages9988-10006
Number of pages19
ISBN (Electronic)9798891760608
DOIs
Publication statusPublished - 10 Dec 2023
EventThe 2023 Conference on Empirical Methods in Natural Language Processing - Resorts World Convention Centre, Sentosa, Singapore
Duration: 6 Dec 202310 Dec 2023
Conference number: 28
https://2023.emnlp.org/

Conference

ConferenceThe 2023 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2023
Country/TerritorySingapore
CitySentosa
Period6/12/2310/12/23
Internet address

Fingerprint

Dive into the research topics of 'Revisiting block-based quantisation: What is important for sub-8-bit LLM inference?'. Together they form a unique fingerprint.

Cite this