ParaMedic: Heterogeneous Parallel Error Correction

S. Ainsworth, Timothy M Jones

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Processor error detection can be reduced in cost significantly by exploiting the parallelism that exists in a repeated copy of an execution, which may not exist in the original code, to split up the redundant work on a large number of small, highly efficient cores. However, such schemes don't provide a method for automatic error recovery. We develop ParaMedic, an architecture to allow efficient automatic correction of errors detected in a system by using parallel heterogeneous cores, to provide a full fail-safe system that does not propagate errors to other systems, and can recover without manual intervention. This uses logging to roll back any computation that occurred after a detected error, along with a set of techniques to provide error-checking parallelism while still preventing the escape of incorrect processor values in multicore environments, where ordering of individual processors' logs is not enough to be able to roll back execution. Across a set of single and multi-threaded benchmarks, we achieve 3.1% and 1.5% overhead respectively, compared with 1.9% and 1% for error detection alone.
Original languageEnglish
Title of host publication2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
PublisherIEEE Xplore
Pages201-213
Number of pages13
ISBN (Electronic)978-1-7281-0057-9
ISBN (Print)978-1-7281-0058-6
DOIs
Publication statusPublished - 22 Aug 2019
Event49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Portland, United States
Duration: 24 Jun 201927 Jun 2019
http://2019.dsn.org/

Publication series

Name
ISSN (Print)1530-0889

Conference

Conference49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Abbreviated titleDSN 2019
CountryUnited States
CityPortland
Period24/06/1927/06/19
Internet address

Keywords

  • error detection
  • multi-threading
  • parallel processing
  • system recovery
  • ParaMedic
  • heterogeneous parallel error correction
  • processor error detection
  • automatic error recovery
  • parallel heterogeneous cores
  • fail-safe system
  • error-checking parallelism
  • multi-threaded benchmarks
  • automatic correction
  • Multicore processing
  • Hardware
  • Error correction codes
  • Parallel processing
  • Out of order
  • Error correction
  • fault tolerance
  • microarchitecture

Fingerprint

Dive into the research topics of 'ParaMedic: Heterogeneous Parallel Error Correction'. Together they form a unique fingerprint.

Cite this