Matrix Multiplication Beyond Auto-Tuning: Rewrite-based GPU Code Generation

Michel Steuwer, Toomas Remmelg, Christophe Dubach

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Graphics Processing Units (GPUs) are used as general purpose parallel accelerators in a wide range of applications. They are found in most computing systems, and mobile devices are no exception. The recent availability of programming APIs such as OpenCL for mobile GPUs promises to open up new types of applications on these devices. However, producing high performance GPU code is extremely difficult. Subtle differences in device characteristics can lead to large performance variations when different optimizations are applied. As we will see, this is especially true for a mobile GPU such as the ARM Mali GPU which has a very different architecture than desktop-class GPUs. Code optimized and tuned for one type of GPUs is unlikely to achieve the performance potential on another type of GPUs.Auto-tuners have traditionally been an answer to this performance portability challenge. For instance, they have been successful on CPUs for matrix operations, which are used as building blocks in many high-performance applications. However, they are much harder to design for different classes of GPUs, given the wide variety of hardware characteristics. In this paper, we take a different perspective and show how performance portability for matrix multiplication is achieved using a compiler approach. This approach is based on a recently developed generic technique that combines a high level programming model with a system of rewrite rules. Programs are automatically rewritten in successive steps, where optimizations decision are made.This approach is truly performance portable, resulting in high-performance code for very different types of architectures such as desktop and mobile PUs. In particular, we achieve a speedup of 1.7x over a state-of-the-art auto-tuner on the ARM Mali GPU.
Original languageEnglish
Title of host publicationCASES '16 Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems
PublisherACM
Number of pages10
ISBN (Print)978-1-4503-4482-1
DOIs
Publication statusPublished - 1 Oct 2016
EventProceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems - Pittsburgh, United States
Duration: 2 Oct 20167 Oct 2016
https://www.esweek.org/archive/index.html

Conference

ConferenceProceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Abbreviated titleCASES '16
Country/TerritoryUnited States
CityPittsburgh
Period2/10/167/10/16
Internet address

Fingerprint

Dive into the research topics of 'Matrix Multiplication Beyond Auto-Tuning: Rewrite-based GPU Code Generation'. Together they form a unique fingerprint.

Cite this