State of the art automatic optimization of OpenCL applications focuses on improving the performance of individual compute kernels. Programmers address opportunities for inter-kernel optimization in specific applications by ad-hoc hand tuning: manually fusing kernels together. However, the complexity of interactions between host and kernel code makes this approach weak or even unviable for applications involving more than a small number of kernel invocations or a highly dynamic control flow, leaving substantial potential opportunities unexplored. It also leads to an over complex, hard to maintain code base. We present Helium, a transparent OpenCL overlay which discovers, manipulates and exploits opportunities for inter-and intra-kernel optimization. Helium is implemented as preloaded library and uses a delay-optimize-replay mechanism in which kernel calls are intercepted, collectively optimized, and then executed according to an improved execution plan. This allows us to benefit from composite optimizations, on large, dynamically complex applications, with no impact on the code base. Our results show that Helium obtains at least the same, and frequently even better performance, than carefully handtuned code. Helium outperforms hand-optimized code where the exact dynamic composition of compute kernel cannot be known statically. In these cases, we demonstrate speedups of up to 3x over unoptimized code and an average speedup of 1.4x over hand optimized code.
|Title of host publication||GPGPU 2015 Proceedings of the 8th Workshop on General Purpose Processing using GPUs|
|Number of pages||11|
|Publication status||Published - 2015|
FingerprintDive into the research topics of 'Helium: a transparent inter-kernel optimizer for OpenCL'. Together they form a unique fingerprint.
- School of Informatics - Personal Chair of Patterned Parallel Computing
- Institute for Computing Systems Architecture
- Computer Systems
Person: Academic: Research Active