Performance Portable GPU Code Generation for Matrix Multiplication

Toomas Remmelg, Thibaut Lutz, Michel Steuwer, Christophe Dubach

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in non-portable solutions that need to be re-optimized for every new device.

Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication. We argue that what is needed is a way to describe applications at a high-level without committing to particular implementations. To this end, we developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way. We use a set of well-defined rewrite rules to automatically transform programs into semantically equivalent device-specific forms, from which OpenCL code is generated.

In this paper, we demonstrate how this approach produces high-performance OpenCL code for GPUs with a well-studied, well-understood application: matrix multiplication. Starting from a single high-level program, our compiler automatically generate highly optimized and specialized implementations. We group simple rewrite rules into more complex macro-rules, each describing a well-known optimization like tiling and register blocking in a composable way. Using an exploration strategy our compiler automatically generates 50,000 OpenCL kernels, each providing a differently optimized – but provably correct – implementation of matrix multiplication. The automatically generated code offers competitive performance compared to the manually tuned MAGMA library implementations of matrix multiplication on Nvidia and even outperforms AMD’s clBLAS library.
Original languageEnglish
Title of host publicationGPGPU-9 General-Purpose GPU Workshop
Number of pages10
ISBN (Print)978-1-4503-4195-0
Publication statusPublished - 2016
EventGeneral-Purpose GPU Workshop - Barcelona, Spain
Duration: 12 Mar 201616 Mar 2016


ConferenceGeneral-Purpose GPU Workshop
Abbreviated titleGPGPU-9
Internet address


Dive into the research topics of 'Performance Portable GPU Code Generation for Matrix Multiplication'. Together they form a unique fingerprint.

Cite this