Abstract
Convolutional Neural Networks (CNNs) are a powerful and versatile tool for performing computer vision tasks in both resource constrained settings and server-side applications. Most GPU hardware vendors provide highly tuned libraries for CNNs such as Nvidia’s cuDNN or ARM Compute Library. Such libraries are the basis for higher-level, commonly-used, machine-learning frameworks such as PyTorch or Caffe, abstracting them away from vendor-specific implementation details. However, writing optimized parallel code for GPUs is far from trivial. This places a significant burden on hardware-specific library writers which have to continually play catch-up with rapid hardware and network evolution.
To reduce effort and reduce time to market, new approaches are needed based on automatic code generation, rather than manual implementation. This paper describes such an approach for direct convolutions using Lift, a new data-parallel intermediate language and compiler. Lift uses a high-level intermediate language to express algorithms which are then automatically optimized using a system of rewrite-rules. Direct convolution, as opposed to the matrix multiplication approach used commonly by machine-learning frameworks, uses an order of magnitude less memory, which is critical for mobile devices. Using Lift, we show that it is possible to generate automatically code that is ×10 faster than the direct convolution while using ×3.6 less space than the GEMM-based convolution of the very specialized ARM Compute Library on the latest generation of ARM Mali GPU.
To reduce effort and reduce time to market, new approaches are needed based on automatic code generation, rather than manual implementation. This paper describes such an approach for direct convolutions using Lift, a new data-parallel intermediate language and compiler. Lift uses a high-level intermediate language to express algorithms which are then automatically optimized using a system of rewrite-rules. Direct convolution, as opposed to the matrix multiplication approach used commonly by machine-learning frameworks, uses an order of magnitude less memory, which is critical for mobile devices. Using Lift, we show that it is possible to generate automatically code that is ×10 faster than the direct convolution while using ×3.6 less space than the GEMM-based convolution of the very specialized ARM Compute Library on the latest generation of ARM Mali GPU.
Original language | English |
---|---|
Title of host publication | GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit |
Publisher | ACM |
Pages | 41-50 |
Number of pages | 10 |
ISBN (Print) | 9781450370257 |
DOIs | |
Publication status | Published - 23 Feb 2020 |
Event | 13th Workshop on General Purpose Processing Using GPU (GPGPU 2020) : @ PPoPP 2020 - San Diego, United States Duration: 23 Feb 2020 → 23 Feb 2020 https://insight-archlab.github.io/gpgpu.html |
Workshop
Workshop | 13th Workshop on General Purpose Processing Using GPU (GPGPU 2020) |
---|---|
Abbreviated title | GPGPU 2020 |
Country/Territory | United States |
City | San Diego |
Period | 23/02/20 → 23/02/20 |
Internet address |
Keywords
- code generation
- convolution
- mobile GPU
- parallelism