ParaDox: Eliminating Voltage Margins via Heterogeneous Fault Tolerance

Sam Ainsworth, Lionel Zoubritzky, Alan Mycroft, Timothy M. Jones

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Providing reliability is becoming a challenge for chip manufacturers, faced with simultaneously trying to improve miniaturization, performance and energy efficiency. This leads to very large margins on voltage and frequency, designed to avoid errors even in the worst case, along with significant hardware expenditure on eliminating voltage spikes and other forms of transient error, causing considerable inefficiency in power consumption and performance.

We flip traditional ideas about reliability and performance around, by exploring the use of error resilience for power and performance gains. ParaMedic is a recent architecture that provides a solution for reliability with low overheads via automatic hardware error recovery, by splitting up checking on to many small cores in a heterogeneous multicore system with hardware logging support, but its design is based on the idea that errors are exceptional. We transform ParaMedic into ParaDox, which shows high performance in both error-intensive and scarce-error scenarios, thus allowing correct execution even when undervolted and overclocked.

Evaluation within error-intensive simulation environments confirms the error resilience of ParaDox and the low associated recovery cost. We estimate that compared to a non-resilient system with margins, ParaDox can reduce energy delay product by 15% through undervolting, while completely recovering from any induced errors.
Original languageEnglish
Title of host publication27th IEEE International Symposium on High-Performance Computer Architecture (HPCA-27)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages13
Publication statusAccepted/In press - 28 Oct 2020
EventThe 27th IEEE International Symposium on High-Performance Computer Architecture - Seoul, Korea, Republic of
Duration: 27 Feb 20213 Mar 2021
Conference number: 27
https://hpca-conf.org/2021/

Conference

ConferenceThe 27th IEEE International Symposium on High-Performance Computer Architecture
Abbreviated titleHPCA 2021
CountryKorea, Republic of
CitySeoul
Period27/02/213/03/21
Internet address

Keywords

  • fault tolerance
  • microarchitecture
  • error detection
  • voltage margins

Fingerprint Dive into the research topics of 'ParaDox: Eliminating Voltage Margins via Heterogeneous Fault Tolerance'. Together they form a unique fingerprint.

Cite this