RMAI: Rethinking memory for AI (inference)

Amir Noohi, Mostafa Derispour, Antonio Barbalace

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As AI models grow exponentially in size, memory has emerged as a critical bottleneck for inference at scale. While hardware solutions like Compute Express Link (CXL) promises to solve the problem of memory capacity and sharing, they require capital investment, and are not widely available. This paper presents RMAI, an in-kernel remote shared memory framework tailored for AI inference workloads, offering a transparent, scalable, and cost-effective software alternative to hardware-based memory expansion and sharing solutions. By leveraging the operating system’s capabilities, RMAI introduces dynamic virtual memory regions that reduce page faults, minimize overheads associated with user-kernel transitions, and optimize data locality for inference workloads. In this paper, we particularly focus on Mixture-of-Experts (MoE) models. In this initial evaluation we demonstrate that RMAI achieves performance levels comparable to CXL-like architectures, with up to 10x faster expert switching and reduced memory management overhead across large-scale inference tasks compared to disk-based solutions. This work
redefines the role of remote shared memory in AI systems, positioning it as a practical and high-performance solution for memory capacity and sharing in modern data centers.
Original languageEnglish
Title of host publicationThe 5th Workshop on Machine Learning and Systems (EuroMLSys ’25)
Place of PublicationNew York, NY, USA
PublisherAssociation for Computing Machinery (ACM)
Pages1-10
Number of pages10
DOIs
Publication statusAccepted/In press - 25 Feb 2025
EventThe 5th Workshop on Machine Learning and Systems - Rotterdam, Netherlands
Duration: 31 Mar 202531 Mar 2025
Conference number: 5
https://euromlsys.eu/#

Workshop

WorkshopThe 5th Workshop on Machine Learning and Systems
Abbreviated titleEuroMLSys 2025
Country/TerritoryNetherlands
CityRotterdam
Period31/03/2531/03/25
Internet address

Keywords / Materials (for Non-textual outputs)

  • memory disaggregation
  • distributed shared memory
  • mixture of experts
  • kernel-level memory management
  • RDMA
  • compute eXpress link

Fingerprint

Dive into the research topics of 'RMAI: Rethinking memory for AI (inference)'. Together they form a unique fingerprint.

Cite this