Optimising calls to large language models with uncertainty-based two-tier selection

Guillem Ramirez, Alexandra Birch, Ivan Titov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Researchers and practitioners operating on a limited budget face the well-known cost-performance trade-off dilemma. The challenging decision often centers on whether to use a large LLM with better performance or a smaller one with reduced costs. This has motivated recent research in the optimisation of LLM calls. Either a cascading strategy is used, where a smaller LLM or both are called causally, or a routing strategy is used, where only one model is ever called. This is dependent on a decision criterion in both scenarios which is typically an auxiliary neural model. In this work, we propose a cost-effective solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. We compare our approach with both cascading and routing strategies using three different pairs of pre-trained small and large LLMs, on nine different tasks and against approaches that require an additional neural model. Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
Original languageEnglish
Title of host publicationProceedings of the 2024 Conference on Language Modeling
Publication statusAccepted/In press - 10 Jul 2024
EventConference on Language Modeling - University of Pennsylvania, Philadelphia, United States
Duration: 7 Oct 20249 Oct 2024
https://colmweb.org/

Conference

ConferenceConference on Language Modeling
Abbreviated titleCOLM 2024
Country/TerritoryUnited States
CityPhiladelphia
Period7/10/249/10/24
Internet address

Fingerprint

Dive into the research topics of 'Optimising calls to large language models with uncertainty-based two-tier selection'. Together they form a unique fingerprint.

Cite this