Performance Evaluation of Adaptive Routing on Dragonfly-based Production Systems

Sudheer Chunduri, Taylor Groves, Kevin Harms, Peter Mendygral, Justs Zarins, Michele Weiland, Yasaman Ghadar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Performance of applications in production environments can be sensitive to network congestion. Cray Aries supports adaptively routing each network packet independently based on the load or congestion encountered as a packet traverses the network. Software can dictate different routing policies, adjusting between minimal and non-minimal bias, for each posted message. We have extensively evaluated the sensitivity of the routing bias selection on application performance as well as whole system performance in both production and controlled conditions. We show that the default routing bias used in Aries-based systems is often sub-optimal and that using a higher bias towards minimal routes will not only reduce the congestion effects on the application but also will decrease the overall congestion on the network. This routing scheme results in not only improved mean performance (by up to 12%) of most production applications but also reduced run-to-run variability. Our study prompted the two supercomputing facilities (ALCF and NERSC) to change the default routing mode on their Aries-based systems. We present the substantial improvement measured in the overall congestion management and interconnect performance in production after making this change.
Original languageEnglish
Title of host publication2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
PublisherInstitute of Electrical and Electronics Engineers
Pages340-349
Number of pages10
ISBN (Electronic)978-1-6654-4066-0
ISBN (Print)978-1-6654-1156-1
DOIs
Publication statusPublished - 28 Jun 2021
Event35th IEEE International Parallel and Distributed Processing Symposium - Online, Portland, United States
Duration: 17 May 202121 May 2021
https://www.ipdps.org/ipdps2021

Publication series

Name
PublisherIEEE
ISSN (Print)1530-2075
ISSN (Electronic)1530-2075

Symposium

Symposium35th IEEE International Parallel and Distributed Processing Symposium
Abbreviated titleIPDPS 2021
Country/TerritoryUnited States
CityPortland
Period17/05/2121/05/21
Internet address

Fingerprint

Dive into the research topics of 'Performance Evaluation of Adaptive Routing on Dragonfly-based Production Systems'. Together they form a unique fingerprint.

Cite this