Abstract / Description of output
Scaling bugs – errors that only manifest at large scale simulations, in terms of number of parallel workers or input size – are critical to detect early in the testing of HPC codes. If missed, these bugs can cause applications to either crash at runtime during production runs or, even worse, silently continue and corrupt results. This results in wasting vast amounts of resources and the crash might not provide any useful debugging information. Laguna et al presented a method for solving this in[1] using an approach where scale variables are traced throughout an application statically and potentially overflowing instructions are detected, with further refinements done by running a few small scale experiments. However, their algorithm is not able to trace multiple code patterns found in production HPC applications, for example code modularity, and has not been applied to Fortran applications. We present an extension to their algorithm which addresses these issues thus enabling us to find scaling bugs in complex real applications where they could not be found before. The key features that enable this are backward/forward tracing and optimistic GEP comparison.
[1] Laguna, I., Schulz, M.: Pinpointing scale-dependent integer overflow bugs in large- scale parallel applications. In: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 216–227. IEEE (2016)
[1] Laguna, I., Schulz, M.: Pinpointing scale-dependent integer overflow bugs in large- scale parallel applications. In: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 216–227. IEEE (2016)
Original language | English |
---|---|
Title of host publication | Lecture Notes in Computer Science |
Subtitle of host publication | High Performance Computing. ISC High Performance 2022 International Workshops |
Publisher | Springer |
Pages | 33-43 |
Number of pages | 11 |
Volume | 13387 |
ISBN (Electronic) | 978-3-031-23220-6 |
ISBN (Print) | 978-3-031-23219-0 |
Publication status | Published - 4 Jan 2023 |
Keywords / Materials (for Non-textual outputs)
- Scaling bugs
- Correctness
- LLVM