Detecting scale-induced overflow bugs in production HPC codes

Justs Zarins*, Michele Weiland, Paul Bartholomew, Leigh Lapworth, Mark Parsons

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Scaling bugs – errors that only manifest at large scale simulations, in terms of number of parallel workers or input size – are critical to detect early in the testing of HPC codes. If missed, these bugs can cause applications to either crash at runtime during production runs or, even worse, silently continue and corrupt results. This results in wasting vast amounts of resources and the crash might not provide any useful debugging information. Laguna et al presented a method for solving this in[1] using an approach where scale variables are traced throughout an application statically and potentially overflowing instructions are detected, with further refinements done by running a few small scale experiments. However, their algorithm is not able to trace multiple code patterns found in production HPC applications, for example code modularity, and has not been applied to Fortran applications. We present an extension to their algorithm which addresses these issues thus enabling us to find scaling bugs in complex real applications where they could not be found before. The key features that enable this are backward/forward tracing and optimistic GEP comparison.

[1] Laguna, I., Schulz, M.: Pinpointing scale-dependent integer overflow bugs in large- scale parallel applications. In: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 216–227. IEEE (2016)
Original languageEnglish
Title of host publicationLecture Notes in Computer Science
Subtitle of host publicationHigh Performance Computing. ISC High Performance 2022 International Workshops
PublisherSpringer
Pages33-43
Number of pages11
Volume13387
ISBN (Electronic)978-3-031-23220-6
ISBN (Print)978-3-031-23219-0
Publication statusPublished - 4 Jan 2023

Keywords / Materials (for Non-textual outputs)

  • Scaling bugs
  • Correctness
  • LLVM

Fingerprint

Dive into the research topics of 'Detecting scale-induced overflow bugs in production HPC codes'. Together they form a unique fingerprint.

Cite this