Abstract
As the number of processors and the size of the memory of computing systems keep increasing, the likelihood of CPU core failures, memory errors, and bus failures increases and can threaten system availability. Software components can be hardened against such failures by running several replicas of a component on hardware replicas that fail independently and that are coordinated by a State-Machine Replication protocol. One common solution is to replicate the physical machine to provide redundancy, and to rewrite the software to address coordination. However, a CPU core failure, a memory error, or a bus error is unlikely to always crash an entire machine. Thus, full machine replication may sometimes be an overkill, increasing resource costs.
In this paper, we introduce full software stack replication within a single commodity machine. Our approach runs replicas on fault-independent hardware partitions (e.g., NUMAnodes), where in each partition is software-isolated from the others and has its own CPU cores, memory, and full software stack. A hardware failure in one partition can be recovered by another partition taking over its functionality. We have realized this vision by implementing FT-Linux, a Linux-based operating system that transparently replicates race-free, multi-threaded POSIX applications on different hardware partitions of a single machine. Our evaluations of FT-Linux on several popular Linux applications show a worst case slowdown (due to replication) by ≈20%.
In this paper, we introduce full software stack replication within a single commodity machine. Our approach runs replicas on fault-independent hardware partitions (e.g., NUMAnodes), where in each partition is software-isolated from the others and has its own CPU cores, memory, and full software stack. A hardware failure in one partition can be recovered by another partition taking over its functionality. We have realized this vision by implementing FT-Linux, a Linux-based operating system that transparently replicates race-free, multi-threaded POSIX applications on different hardware partitions of a single machine. Our evaluations of FT-Linux on several popular Linux applications show a worst case slowdown (due to replication) by ≈20%.
Original language | English |
---|---|
Title of host publication | 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 1521-1531 |
Number of pages | 11 |
ISBN (Electronic) | 978-1-5386-1792-2 |
ISBN (Print) | 978-1-5386-1793-9 |
DOIs | |
Publication status | Published - 17 Jul 2017 |
Event | The 37th IEEE International Conference on Distributed Computing Systems - Atlanta, United States Duration: 5 Jun 2017 → 8 Jun 2017 http://icdcs2017.gatech.edu/ |
Publication series
Name | |
---|---|
Publisher | Institute of Electrical and Electronics Engineers (IEEE) |
ISSN (Print) | 1063-6927 |
Conference
Conference | The 37th IEEE International Conference on Distributed Computing Systems |
---|---|
Abbreviated title | ICDCS 2017 |
Country/Territory | United States |
City | Atlanta |
Period | 5/06/17 → 8/06/17 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- fault tolerant computing
- Linux
- multiprocessing systems
- multi-threading
- software engineering
- transparent fault-tolerance
- intra-machine full-software-stack replication
- commodity multicore hardware
- fault-independent hardware partitions
- hardware failure
- FT-Linux
- Linux-based operating system
- race-free multithreaded POSIX applications
- Hardware
- Fault tolerance
- Fault tolerant systems
- Kernel
- Servers
- Hardware failure
- software failure
- multicore
- replicated-kernel OS
- state-machine replication
- POSIX