High-Level Hardware Feature Extractionfor GPU Performance Prediction of Stencils

Toomas Remmelg, Bastian Hagedorn, Lu Li, Michel Steuwer, Sergei Gorlatch, Christophe Dubach

Research output: Chapter in Book/Report/Conference proceedingConference contribution


High-level functional programming abstractions have started to show promising results for HPC (High-Performance Computing). Approaches such as Lift, Futhark or Delite have shown that it is possible to have both, high-level abstractions and performance, even for HPC workloads such as stencils. In addition, these high-level functional abstractions can also be used to represent programs and their optimized variants, within the compiler itself. However, such high-level approaches rely heavily on the compiler to optimize programs which is notoriously hard when targeting GPUs.

Compilers either use hand-crafted heuristics to direct the optimizations or iterative compilation to search the optimization space. The first approach has fast compile times, however, it is not performance-portable across different devices and requires a lot of human effort to build the heuristics. Iterative compilation, on the other hand, has the ability to search the optimization space automatically and adapts to different devices. However, this process is often very time-consuming as thousands of variants have to be evaluated. Performance models based on statistical techniques have been proposed to speed up the optimization space exploration. However, they rely on low-level hardware features, in the form of performance counters or low-level static code features.

Using the Lift framework, this paper demonstrates how low-level, GPU-specific features are extractable directly from a high-level functional representation. The Lift IR (Intermediate Representation) is in fact a very suitable choice since all optimization choices are exposed at the IR level. This paper shows how to extract low-level features such as number of unique cache lines accessed per warp, which is crucial for building accurate GPU performance models. Using this approach, we are able to speed up the exploration of the space by a factor 2000x on an AMD GPU and 450x on Nvidia on average across many stencil applications.
Original languageEnglish
Title of host publicationGPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit
PublisherACM Association for Computing Machinery
Number of pages10
ISBN (Print)9781450370257
Publication statusPublished - 23 Feb 2020
Event13th Workshop on General Purpose Processing Using GPU (GPGPU 2020) : @ PPoPP 2020 - San Diego, United States
Duration: 23 Feb 202023 Feb 2020


Workshop13th Workshop on General Purpose Processing Using GPU (GPGPU 2020)
Abbreviated titleGPGPU 2020
CountryUnited States
CitySan Diego
Internet address


  • Performance models
  • GPUs optimization
  • Stencil computation
  • Features extraction


Dive into the research topics of 'High-Level Hardware Feature Extractionfor GPU Performance Prediction of Stencils'. Together they form a unique fingerprint.

Cite this