BayesTune: Bayesian Sparse Deep Model Fine-tuning

Minyoung Kim, Timothy M Hospedales

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Deep learning practice is increasingly driven by powerful foundation models (FM), pre-trained at scale and then fine-tuned for specific tasks of interest. A key property of this workflow is the efficacy of performing sparse or parameter-efficient finetuning, meaning that by updating only a tiny fraction of the whole FM parameters on a downstream task can lead to surprisingly good performance, often even superior to a full model update. However, it is not clear what is the optimal and principled way to select which parameters to update. Although a growing number of sparse fine-tuning ideas have been proposed, they are mostly not satisfactory, relying on hand-crafted heuristics or heavy approximation. In this paper we propose a novel Bayesian sparse fine-tuning algorithm: we place a (sparse) Laplace prior for each parameter of the FM, with the mean equal to the initial value and the scale parameter having a hyper-prior that encourages small scale. Roughly speaking, the posterior means of the scale parameters indicate how important it is to update the corresponding parameter away from its initial value when solving the downstream task. Given the sparse prior, most scale parameters are small a posteriori, and the few large-valued scale parameters identify those FM parameters that crucially need to be updated away from their initial values. Based on this, we can threshold the scale parameters to decide which parameters to update or freeze, leading to a principled sparse fine-tuning strategy. To efficiently infer the posterior distribution of the scale parameters, we adopt the Langevin MCMC sampler, requiring only two times the complexity of the vanilla SGD. Tested on popular NLP benchmarks as well as the VTAB vision tasks, our approach shows significant improvement over the state-of-the-arts (e.g., 1% point higher than the best SOTA when fine-tuning RoBERTa for GLUE and SuperGLUE benchmarks).
Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems 36 (NeurIPS 2023)
PublisherCurran Associates Inc
Number of pages49
Publication statusPublished - 15 Dec 2023
EventThirty-seventh Conference on Neural Information Processing Systems - New Orleans, United States
Duration: 10 Dec 202316 Dec 2023
Conference number: 37


ConferenceThirty-seventh Conference on Neural Information Processing Systems
Abbreviated titleNeurIPS 2023
Country/TerritoryUnited States
CityNew Orleans
Internet address


Dive into the research topics of 'BayesTune: Bayesian Sparse Deep Model Fine-tuning'. Together they form a unique fingerprint.

Cite this