Introducing and Implementing the Allpairs Skeleton for Programming Multi-GPU Systems

Michel Steuwer, Malte Friese, Sebastian Albers, Sergei Gorlatch

Research output: Contribution to journalArticlepeer-review


Algorithmic skeletons simplify software development: they abstract typical patterns of parallelism and provide their efficient implementations, allowing the application developer to focus on the structure of algorithms, rather than on implementation details. This becomes especially important for modern parallel systems with multiple graphics processing units (GPUs) whose programming is complex and error-prone, because state-of-the-art programming approaches like CUDA and OpenCL lack high-level abstractions. We define a new algorithmic skeleton for allpairs computations which occur in real-world applications, ranging from bioinformatics to physics. We develop the skeleton’s generic parallel implementation for multi-GPU Systems in OpenCL. To enable the automatic use of the fast GPU memory, we identify and implement an optimized version of the allpairs skeleton with a customizing function that follows a certain memory access pattern. We use matrix multiplication as an application study for the allpairs skeleton and its two implementations and demonstrate that the skeleton greatly simplifies programming, saving up to 90 % of lines of code as compared to OpenCL. The performance of our optimized implementation is up to 6.8 times higher as compared with the generic implementation and is competitive to the performance of a manually written optimized OpenCL code.
Original languageEnglish
Pages (from-to)601-618
Number of pages18
JournalInternational journal of parallel programming
Issue number4
Publication statusPublished - 24 Aug 2013


Dive into the research topics of 'Introducing and Implementing the Allpairs Skeleton for Programming Multi-GPU Systems'. Together they form a unique fingerprint.

Cite this