Transparent Fault-Tolerance Using Intra-Machine Full-Software-Stack Replication on Commodity Multicore Hardware

Giuliano Losa, Antonio Barbalace, Yuzhong Wen, Marina Sadini, Ho-Ren Chuang, Binoy Ravindran

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As the number of processors and the size of the memory of computing systems keep increasing, the likelihood of CPU core failures, memory errors, and bus failures increases and can threaten system availability. Software components can be hardened against such failures by running several replicas of a component on hardware replicas that fail independently and that are coordinated by a State-Machine Replication protocol. One common solution is to replicate the physical machine to provide redundancy, and to rewrite the software to address coordination. However, a CPU core failure, a memory error, or a bus error is unlikely to always crash an entire machine. Thus, full machine replication may sometimes be an overkill, increasing resource costs.
In this paper, we introduce full software stack replication within a single commodity machine. Our approach runs replicas on fault-independent hardware partitions (e.g., NUMAnodes), where in each partition is software-isolated from the others and has its own CPU cores, memory, and full software stack. A hardware failure in one partition can be recovered by another partition taking over its functionality. We have realized this vision by implementing FT-Linux, a Linux-based operating system that transparently replicates race-free, multi-threaded POSIX applications on different hardware partitions of a single machine. Our evaluations of FT-Linux on several popular Linux applications show a worst case slowdown (due to replication) by ≈20%.
Original languageEnglish
Title of host publication2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)
PublisherInstitute of Electrical and Electronics Engineers
Pages1521-1531
Number of pages11
ISBN (Electronic)978-1-5386-1792-2
ISBN (Print)978-1-5386-1793-9
DOIs
Publication statusPublished - 17 Jul 2017
EventThe 37th IEEE International Conference on Distributed Computing Systems - Atlanta, United States
Duration: 5 Jun 20178 Jun 2017
http://icdcs2017.gatech.edu/

Publication series

Name
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
ISSN (Print)1063-6927

Conference

ConferenceThe 37th IEEE International Conference on Distributed Computing Systems
Abbreviated titleICDCS 2017
Country/TerritoryUnited States
CityAtlanta
Period5/06/178/06/17
Internet address

Keywords / Materials (for Non-textual outputs)

  • fault tolerant computing
  • Linux
  • multiprocessing systems
  • multi-threading
  • software engineering
  • transparent fault-tolerance
  • intra-machine full-software-stack replication
  • commodity multicore hardware
  • fault-independent hardware partitions
  • hardware failure
  • FT-Linux
  • Linux-based operating system
  • race-free multithreaded POSIX applications
  • Hardware
  • Fault tolerance
  • Fault tolerant systems
  • Kernel
  • Servers
  • Hardware failure
  • software failure
  • multicore
  • replicated-kernel OS
  • state-machine replication
  • POSIX

Fingerprint

Dive into the research topics of 'Transparent Fault-Tolerance Using Intra-Machine Full-Software-Stack Replication on Commodity Multicore Hardware'. Together they form a unique fingerprint.

Cite this