Abstract
Graphics Processing Units (GPUs) are used as general purpose parallel accelerators in a wide range of applications. They are found in most computing systems, and mobile devices are no exception. The recent availability of programming APIs such as OpenCL for mobile GPUs promises to open up new types of applications on these devices. However, producing high performance GPU code is extremely difficult. Subtle differences in device characteristics can lead to large performance variations when different optimizations are applied. As we will see, this is especially true for a mobile GPU such as the ARM Mali GPU which has a very different architecture than desktop-class GPUs. Code optimized and tuned for one type of GPUs is unlikely to achieve the performance potential on another type of GPUs.Auto-tuners have traditionally been an answer to this performance portability challenge. For instance, they have been successful on CPUs for matrix operations, which are used as building blocks in many high-performance applications. However, they are much harder to design for different classes of GPUs, given the wide variety of hardware characteristics. In this paper, we take a different perspective and show how performance portability for matrix multiplication is achieved using a compiler approach. This approach is based on a recently developed generic technique that combines a high level programming model with a system of rewrite rules. Programs are automatically rewritten in successive steps, where optimizations decision are made.This approach is truly performance portable, resulting in high-performance code for very different types of architectures such as desktop and mobile PUs. In particular, we achieve a speedup of 1.7x over a state-of-the-art auto-tuner on the ARM Mali GPU.
| Original language | English |
|---|---|
| Title of host publication | CASES '16 Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems |
| Publisher | ACM |
| Number of pages | 10 |
| ISBN (Print) | 978-1-4503-4482-1 |
| DOIs | |
| Publication status | Published - 1 Oct 2016 |
| Event | Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems - Pittsburgh, United States Duration: 2 Oct 2016 → 7 Oct 2016 https://www.esweek.org/archive/index.html |
Conference
| Conference | Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems |
|---|---|
| Abbreviated title | CASES '16 |
| Country/Territory | United States |
| City | Pittsburgh |
| Period | 2/10/16 → 7/10/16 |
| Internet address |