Transient and Permanent Error Control for High-End Multiprocessor Systems-on-Chip

Q. Yu, Jose Cano Reyes, J. Flich, P. Ampadu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

High-end MPSoC systems with built-in high-radix topologies achieve good performance because of the improved connectivity and the reduced network diameter. In high-end MPSoC systems, fault tolerance support is becoming a compulsory feature. In this work, we propose a combined method to address permanent and transient link and router failures in those systems. The LBDRhr mechanism is proposed to tolerate permanent link failures in some popular high-radix topologies. The increased router complexity may lead to more transient router errors than routers using simple XY routing algorithm. We exploit the inherent information redundancy (IIR) in LBDRhr logic to manage transient errors in the network routers. Thorough analyses are provided to discover the appropriate internal nodes and the forbidden signal patterns for transient error detection. Simulation results show that LBDRhr logic can tolerate all of the permanent failure combinations of long-range links and 80% of links failures at short-range links. Case studies show that the error detection method based on the new IIR extraction method reduces the power consumption and the residual error rate by 33% and up to two orders of magnitude, respectively, compared to triple modular redundancy. The impact of network topologies on the efficiency of the detection mechanism has been examined in this work, as well.
Original languageEnglish
Title of host publicationNetworks on Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium on
Pages169-176
Number of pages8
DOIs
Publication statusPublished - 1 May 2012

Keywords

  • circuit complexity
  • digital arithmetic
  • error detection
  • error statistics
  • fault tolerance
  • logic design
  • multiprocessing systems
  • network routing
  • network topology
  • network-on-chip
  • redundancy
  • IIR extraction method
  • LBDRhr logic
  • LBDRhr mechanism
  • XY routing algorithm
  • built-in high-radix topology
  • combined method
  • compulsory feature
  • connectivity
  • detection mechanism
  • error detection method
  • fault tolerance support
  • forbidden signal patterns
  • high-end MPSoC systems
  • high-end multiprocessor systems-on-chip
  • inherent information redundancy
  • internal nodes
  • long-range links
  • network diameter
  • network routers
  • permanent error control
  • permanent failure combinations
  • permanent link failures
  • popular high-radix topology
  • power consumption
  • residual error rate
  • router complexity
  • router failures
  • short-range links
  • thorough analyses
  • transient error control
  • transient error detection
  • transient errors
  • transient link
  • transient router errors
  • triple modular redundancy
  • Logic gates
  • Network topology
  • Redundancy
  • Routing
  • Topology
  • Transient analysis
  • Networks-on-chip
  • arbiter
  • fault tolerant
  • information redundancy
  • permanent error
  • reliability
  • transient error

Fingerprint

Dive into the research topics of 'Transient and Permanent Error Control for High-End Multiprocessor Systems-on-Chip'. Together they form a unique fingerprint.

Cite this