ServerlessLLM: Low-latency serverless inference for large language models

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers, ServerlessLLM achieves effective local checkpoint storage, minimizing the need for remote checkpoint downloads and ensuring efficient checkpoint loading. The design of ServerlessLLM features three core contributions: (i) fast multi-tier checkpoint loading, featuring a new loading-optimized checkpoint format and a multi-tier loading system, fully utilizing the bandwidth of complex storage hierarchies on GPU servers; (ii) efficient live migration of LLM inference, which enables newly initiated inferences to capitalize on local checkpoint storage while ensuring minimal user interruption; and (iii) startup-time-optimized model scheduling, which assesses the locality statuses of checkpoints on each server and schedules the model onto servers that minimize the time to start the inference. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems, reducing latency by 10 - 200X across various LLM inference workloads.
Original languageEnglish
Title of host publicationProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation
PublisherUSENIX Association
Pages135-153
Number of pages19
ISBN (Electronic)9781939133403
ISBN (Print)9781713899860
DOIs
Publication statusPublished - 12 Jul 2024
Event18th USENIX Symposium on Operating Systems Design and Implementation - Santa Clara, United States
Duration: 10 Jul 202412 Jul 2024

Conference

Conference18th USENIX Symposium on Operating Systems Design and Implementation
Abbreviated titleOSDI 2024
Country/TerritoryUnited States
CitySanta Clara
Period10/07/2412/07/24

Fingerprint

Dive into the research topics of 'ServerlessLLM: Low-latency serverless inference for large language models'. Together they form a unique fingerprint.

Cite this