Full-System Simulation of Mobile CPU/GPU Platforms

Citation for published version:

Published In:

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim.
Full-System Simulation of Mobile CPU/GPU Platforms

Kuba Kaszyk*, Harry Wagstaff*, Tom Spink*, Björn Franke*, Mike O’Boyle*, Bruno Bodin† and Henrik Uhrenholt‡

*School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, Email: see https://www.ed.ac.uk/informatics/people
† Yale-NUS College, School of Computing, National University of Singapore. Email: bruno.bodin@yale-nus.edu.sg
‡Arm Sweden, Lund, Sweden Email: Henrik.Uhrenholt@arm.com

Abstract—Graphics Processing Units (GPUs) critically rely on a complex system software stack comprising kernel- and user-space drivers and Just-in-time (JIT) compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU/GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. In this paper we develop a full-system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali-G71 GPU powered device. We validate our simulator against a hardware implementation and ARM’s stand-alone GPU simulator, achieving 100% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework by optimizing an advanced Computer Vision application using simulated statistics unavailable with other simulation approaches or physical GPU implementations. We demonstrate that performance optimizations for desktop GPUs trigger bottlenecks on mobile GPUs, and show the importance of efficient memory use.

Index Terms—Computer simulation

I. INTRODUCTION

GPU simulation is central to driving GPU research and development. It is used for early design space exploration and architecture tuning [1]–[3], evaluation of GPU compilation techniques [4], application development and optimization [5], and in virtual platforms for system software development [6]. While Central Processing Unit (CPU) simulation techniques have reached maturity, GPU simulation often suffers from the following problems: (a) instruction sets are not accurately modeled, but approximated by an artificial, low-level intermediate representation [7], [8], (b) GPU simulators do not model existing commercial GPUs, but only simplified GPU architectures [9], (c) instead of using vendor provided driver stacks and compilers, GPU simulators often rely on simplified system software, which may behave entirely differently to original tools [10], [11], and (d) GPUs are treated as standalone devices, not modeling any CPU-GPU transactions [12]. This has led researchers using GPU simulation to rely on tools providing questionable accuracy [13].

For example, many existing GPU simulators including gem5-GPU [14], GpuTias [15], Multi2Sim [10], GPGPU-Sim [16], and Multi2Sim-Kepler [11] claim cycle-accuracy. However, despite their claimed cycle-accuracy all of these simulators show significant differences in the reported cycle count (or other reported performance metrics) compared to actual hardware. In extreme cases, these errors can be in excess of over 100%. Other GPU simulators have either not been evaluated against hardware reference platforms or do not attempt to model any available GPU. Available instruction-accurate simulators, i.e., those without a cycle-level timing model, also show significant errors. For example, Barra [17] exhibits up to 81.6% difference in instruction counts compared to those reported by measurements on a real GPU.

Further error is introduced by the use of outdated or non-standard GPU tool chains required by several simulators. We compiled a set of OpenCL kernels with different versions of the vendor supplied compiler¹ (v5.6, 5.7, 6.0, 6.1, 6.2) for the Arm Mali-G71. Fig. 1 shows that we observed major differences between compiler versions, e.g., GPU arithmetic cycles in the selected kernel differ by 47% (6.0 to 6.1). It is more than likely that simplified or non-vendor supplied tool chains used by other GPU simulators introduce even greater error, as also highlighted in [13].

In this paper we claim that without a truly accurate GPU simulation model and a full-system environment capable of running an unmodified GPU software stack and applications

it is not possible to gather reliable performance metrics to underpin GPU architecture research.

A. Full-System GPU Simulation

In this paper, we propose a fundamentally different approach to GPU simulation, avoiding the aforementioned issues. The goal of this work is to accurately simulate a state-of-the-art mobile GPU in a full-system context, enabling the use of unmodified vendor-supplied drivers and JIT compilers, operating in an unmodified target operating system, and executing unmodified applications. This requires our GPU simulator to be architecturally identical to a physical GPU, and model all components of the application and system software stack.

We focus on functional CPU/GPU simulation, i.e. without detailed timing information. While this method sacrifices cycle-accuracy, it enables us to improve simulation performance to a level where it is feasible to run complex CPU/GPU workloads. Such a functional simulator is also a prerequisite to detailed timing simulation and can still provide useful execution statistics, such as instruction counts, memory traces, and CPU-GPU transaction details. Simultaneously our system guarantees optimal GPU feature support, and ensures that our virtual platform executes identical code to that on physical hardware. Our fast simulation approach also supports interactive workloads, and new Application Programming Interfaces (API)s (e.g. Vulkan) without additional engineering.

Notable use cases for our full-system CPU/GPU simulation technology are (1) early GPU design space exploration, where a GPU currently under design can be evaluated and (2) virtual platforms for both system- and user-level software development, both without producing a physical version. These use cases benefit particularly from the accuracy and performance that our integrated CPU/GPU simulation approach offers.

B. State-of-the-Art

In order to further motivate our full-system approach to CPU/GPU simulation, we initially review the most popular GPU simulators: GPGPU-Sim [16] and MULTI2SIM [10], [11]. In Fig. 2 we compare the GPU kernel execution and software stack for (a) a native execution environment, (b) our full-system simulator, (c) MULTI2SIM, and (d) GPGPU-Sim. For MULTI2SIM and GPGPU-Sim we have highlighted non-standard software components that are different from the vendor-supplied driver stack and thus represent a source of inaccuracy.

Fig. 2: Comparison of the GPU kernel execution model and software stack for (a) a native execution environment, (b) our full-system simulator, (c) Multi2Sim, and (d) GPGPU-Sim. For MultiSim and GPGPU-Sim we have highlighted non-standard software components that are different from the vendor-supplied driver stack and thus represent a source of inaccuracy.
However, GP-GPU-SIM requires its own runtime libraries and device drivers, which (a) differ substantially from the vendor supplied libraries, (b) are not feature complete, and (c) introduce significant accuracy problems.

C. Contributions

In this paper we develop a full-system system simulation environment for a mobile platform, enabling users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali-G71 GPU powered device. We validate our simulator against a hardware implementation as well as Arm’s stand-alone GPU simulator, achieving 100% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework by optimizing an advanced Computer Vision application using simulated statistics unavailable with other simulation approaches or physical GPU implementations. We then make a direct comparison against desktop GPUs, and show that memory usage is hugely significant to mobile GPU performance.

II. BACKGROUND: ARM BIFROST GPU

Here, we provide a brief overview of the Arm Bifrost GPU architecture. This is a state-of-the-art mobile GPU design powering many high-end smartphone Systems-on-chips (SoCs). In our simulator we implement the Arm Mali-G71 GPU, found in e.g. the Exynos 8895 SoC that powers the Samsung Galaxy S8. Fig. 3a gives an overview of the Bifrost architecture.2

The architecture features up to 32 unified Shader Core (SC) s, and a single logical L2 GPU cache that is split into several fully coherent physical cache segments. Full system coherency support and shared main memory tightly couples the GPU and CPU memory systems. For this, Bifrost features a built-in Memory Management Unit (MMU) supporting AArch64 and LPAE address modes. A central Job Manager (JM) interacts with the driver stack and orchestrates GPU jobs.

A. Shader Cores

Shader Core (SC)s (see Fig. 3b), are blocks consisting of Execution Engine (EE)s–three in the Mali-G71–and a number of data processing units, linked by a messaging fabric.

The EEs are responsible for executing the programmable shader instructions, each including an arithmetic processing pipeline as well as all of the required thread state.

The arithmetic units implement a vectorization scheme to improve functional unit utilization. Threads are grouped into bundles of four (a “quad”), which fill the width of a 128-bit data processing unit. From the viewpoint of a single thread, the architecture behaves as a stream of scalar 32-bit operations.

Instructions are bundled into clauses of up to 8 tuples (16 instructions), as shown in Fig. 4a. Within a clause, instructions can access temporary registers, reducing pressure on the global register file (see Fig. 4b). Further details can be found in [18].

2Figures 3 and 4a reproduced with kind permission of Arm Ltd.
III. Our Simulation Approach

Our simulation environment, as shown in Fig. 5, provides a full-system view of a CPU/GPU platform. Such an approach also requires additional components to be emulated including an MMU, interrupt controller, timer devices, storage and network devices. In order to benefit from existing device drivers, we model the Arm VERSATILE EXPRESS and JUNO platforms, each augmented with an Arm Mali-G71 GPU.

Both the simulated CPU and GPU are modeled using high-level architecture descriptions [19], and generated using a retargetable simulation framework [20], which also supports other architectures. They each run in separate threads on the host CPU, providing concurrent and asynchronous operation.

A. CPU Simulation

We simulate the CPU through full-system Dynamic Binary Translation (DBT) (similar to QEMU [21]), which boots a Linux Arm kernel and user space from a file system mounted by the simulated storage device. For complete and accurate modeling, we simulate essential platform devices, ensuring that our simulator can support a full software stack without simulation-specific adaptation of any software component.

B. GPU Simulation

We generate an interpretive GPU simulation module for the programmable GPU SCs from the Mali architecture description, and non-core components are directly implemented.

1) CPU-GPU Interface: The GPU interfaces with the CPU via memory mapped registers, hardware interrupts, and memory, through which the simulated GPU exposes its JM to the CPU. For GPU compute jobs, the OPENCL driver sets up shader programs in the shared CPU-GPU memory space, and then triggers an interrupt on the GPU by writing to a control register, indicating that a job is ready for execution. These interrupts are visible to the JM, which begins execution.

2) Shader Core Simulation: The generated simulator code comprises the instruction decoder and main EEs of the GPU. The interpretive execution model is split into two phases: (1) decode, and (2) execution. During phase one, the shader program and its associated metadata are decoded for later use. In phase two, a dispatcher iterates over the job dimensions and creates simulated GPU threads. These threads are grouped into “warps”, where all threads execute in lockstep. Warps are in turn grouped into thread-groups, i.e. OPENCL workgroups.

3) Performance Optimizations: The simulation is broken up into two stages - decode, and execution. During the decode stage, the GPU extensively caches guest code, which is then accessed during the execution phase. This model ensures that the entire shader program is decoded exactly once.

In hardware, each SC executes one thread-group at a time. In our simulator, however, the number of SCs and host threads is individually configurable. For example, instead of mapping 8 SCs onto 8 host threads, we can map the executing thread-groups onto 32 host threads, creating virtual cores.

This necessitates additional measures for managing local storage. The GPU driver allocates local storage for 8 thread-groups corresponding to the 8 detected SCs. To support more thread-groups executing in parallel, the simulator allocates additional local memory for each host thread, outwith the guest system. Local guest memory accesses are intercepted and mapped to host memory, guaranteeing functional correctness.

4) Job Manager Simulation: In our GPU simulator the JM operates in its own host simulation thread. It fully implements the functionality of its hardware counterpart such as parsing job descriptors and orchestrating the operation of the SCs.

5) Memory Management Unit Simulation: Our simulator incorporates a complete software implementation of the GPU’s MMU. The driver provides the MMU with page table pointers, and the MMU reports errors (permissions violations, faults) to the driver through memory mapped registers and interrupts.

IV. Instrumentation

Through instrumentation the simulator provides useful statistics, without the overhead of a cycle-accurate simulator:

A. Program Execution

We gather instruction counts and breakdowns, data accesses, and clause information - statistics directly relating to the executing instructions. From these we can directly see the codesize, ratios of memory instructions to arithmetic, types of memory accesses - all vital to understanding performance implications of the executed code. Each clause is instrumented with detailed metrics at decode time, and during execution, we record clause frequency. If executing with multiple host threads, this is gathered by each parallel unit. Metrics are tallied at job completion, requiring no further synchronization.

B. System

The GPU operates as an accelerator, therefore it is vital to understand its interaction with the rest of the system. The number of pages accessed by the GPU shows the interaction
with the memory system and MMU, which are expensive in terms of performance. Interrupts and system register accesses describe the communication with the CPU – also a bottleneck.

C. Control Flow

Control flow execution in the GPU monitors thread divergence, which occurs when threads within a warp take different paths after a conditional branch. This is a serious performance problem, as if a thread diverges, other threads in the warp must wait for the diverging thread to reconverge, without scheduling other work. We monitor this by tracking the PC on clause boundaries, and building a Control Flow Graph (CFG). This CFG shows which thread executes which path, and identifies diverging threads at their divergence point, as shown in Fig. 6.

![Control Flow Graph](image)

**Fig. 6:** BFS: Our simulator generates a control flow graph pinpointing the divergence on actual GPU instructions.

TABLE I: System configurations for performance evaluation.

<table>
<thead>
<tr>
<th>Simulated Platform</th>
<th>Evaluation Platform</th>
<th>Host Platform 1 (main experiments)</th>
<th>Host Platform 2 (Parallel scaling, Fig. 10)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Arm-v7A/v8A CPU</td>
<td>HiKEY960 - Arm-v8A CPU</td>
<td>Intel(R) Core(TM)/CPU 17-4710MQ</td>
<td>Intel(R) Xeon(R) CPU L7555</td>
</tr>
<tr>
<td>Arm Mali Bifrost GPU - G71, 8 Cores</td>
<td>Arm Mali Bifrost GPU - G71, 8 Cores</td>
<td>4 cores with HT, 2.50GHz</td>
<td>32 cores with HT, 1.87GHz</td>
</tr>
<tr>
<td>Arch Linux (Kernel 4.8.8)</td>
<td>Android-O/Debian Linux</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Arm Mali Bifrost DDK r3p0/r9p0</td>
<td>Arm Mali Bifrost DDK r3p0/r9p0</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**TABLE II: Benchmarks and data set sizes.**

<table>
<thead>
<tr>
<th>Suite</th>
<th>Benchmark</th>
<th>Input Type &amp; Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>Rodinia 3.1</td>
<td>Back Propagation</td>
<td>65536 nodes</td>
</tr>
<tr>
<td>Parboil</td>
<td>Breadth First Search</td>
<td>1257001 nodes</td>
</tr>
<tr>
<td>AMD APP 2.5</td>
<td>Binary Search</td>
<td>16777216 elements</td>
</tr>
<tr>
<td>AMD APP 2.5</td>
<td>Binomial Option</td>
<td>512 samples</td>
</tr>
<tr>
<td>AMD APP 2.5</td>
<td>Bitonic Sort</td>
<td>2048 elements</td>
</tr>
<tr>
<td>Parboil</td>
<td>Cutoff-limited Coulombic Potential (cutcp)</td>
<td>67 atoms</td>
</tr>
<tr>
<td>AMD APP 2.5</td>
<td>DCT</td>
<td>10000x1000 matrix</td>
</tr>
<tr>
<td>AMD APP 2.5</td>
<td>DtwHaar1D</td>
<td>8388608 signal</td>
</tr>
<tr>
<td>AMD APP 2.5</td>
<td>Floyd Warshall</td>
<td>256 nodes</td>
</tr>
<tr>
<td>AMD APP 2.5</td>
<td>Matrix Transpose</td>
<td>3008x3008 matrix</td>
</tr>
<tr>
<td>Rodinia 3.1</td>
<td>Nearest Neighbor</td>
<td>5 records</td>
</tr>
<tr>
<td></td>
<td></td>
<td>50 latitude</td>
</tr>
<tr>
<td></td>
<td></td>
<td>90 longitude</td>
</tr>
<tr>
<td>AMD APP 2.5</td>
<td>Recursive Gaussian</td>
<td>1536x1536 image</td>
</tr>
<tr>
<td>AMD APP 2.5</td>
<td>Reduction</td>
<td>9999960 elements</td>
</tr>
<tr>
<td>AMD APP 2.5</td>
<td>Scan Large Arrays</td>
<td>1048576 elements</td>
</tr>
<tr>
<td>Parboil</td>
<td>SGEMM</td>
<td>128x96, 96x160 matrices</td>
</tr>
<tr>
<td>AMD APP 2.5</td>
<td>SobelFilter</td>
<td>1536x1536 image</td>
</tr>
<tr>
<td>Parboil</td>
<td>Sparse Matrix Vector Mult.</td>
<td>1138x1138x2596 matrix</td>
</tr>
<tr>
<td>Parboil</td>
<td>Stencil</td>
<td>128x128x32 matrix</td>
</tr>
<tr>
<td></td>
<td></td>
<td>100 iterations</td>
</tr>
<tr>
<td>AMD APP 2.5</td>
<td>URNG</td>
<td>1536x1536 image</td>
</tr>
<tr>
<td>cblas</td>
<td>SGEMM</td>
<td>1024x1024 matrix</td>
</tr>
</tbody>
</table>

V. EVALUATION

First, we present the validation strategy for our simulator against Arm hardware and a proprietary simulator, achieving 100% architectural accuracy across all available toolchains. We then compare our simulator’s performance and effectiveness against Multi2Sim 5.0. Unless explicitly stated, all comparisons against Multi2Sim use Multi2Sim’s functional simulation mode. Finally, we demonstrate the versatility of our simulator through a series of use cases. Our evaluation focuses on the widely accepted OpenCL compute API, which allows for direct comparison with other GPU simulators.

Details of our host and guest platforms are provided in Table I. As different benchmarks scale in different ways, the default host configuration used 8 threads for GPU simulation. We show additional results for selected benchmarks.

We chose kernels from a variety of benchmark suites. First, we include AMD APP SDK 2.5 as pre-compiled GPU binaries packaged with Multi2Sim enable direct comparison. AMD driver 2.5, which Multi2Sim, and its compiler Multi2C, rely on, is no longer available, and code compiled using newer versions often contains features unsupported by Multi2Sim. We also report results for Parboil [22] and Rodinia [23] benchmark suites, which provide larger, more complex workloads. The benchmarks and inputs are presented in Table II.

Next, we consider a robotic vision application, SLAM-Bench, demonstrating a concrete use-case for our approach. We show that our simulated metrics relate directly to hardware runtimes. Finally, we demonstrate how optimizations for desktop GPUs trigger bottlenecks on embedded GPUs.

A. Validation and Accuracy

Correctness of our full-system simulation approach has been established by comparison against the commercially available HiKEY 960 with a Mali-G71 MP8 GPU. We also validated the GPU part of our simulator against a detailed proprietary simulator for the target GPU architecture. Our comparisons
have shown complete accuracy for the evaluated benchmarks, for all evaluated metrics. This is possible only because our simulation is driven by the exact binary that is executed in hardware, thanks to the full support of a native software stack.

1) Comparison to Hardware: Validation has focused on: (a) Correctness of OpenCL kernel execution on the GPU, evaluated through extensive testing, (b) correctness of performance metrics, including instruction counts, instruction breakdowns, clause sizes, data access breakdowns, and divergence, for which we compare results from our instrumented simulator to hardware performance counters on the HiKey 960.

2) Comparison to Reference Simulator: We have also validated our simulator against a proprietary, detailed standalone GPU simulator. We executed selected kernels on both simulators using an instruction tracing mode, where individual instructions and their effects are observable. Additionally, we employed fuzzing techniques for rigorous instruction testing, covering an extensive range of inputs.

B. Simulation Performance

Next we evaluate three key simulation performance metrics.

1) GPU OpenCL Simulation Speed: Fig. 8, presents execution performance of our GPU simulator relative to Multi2Sim, where most benchmarks exhibit similar performance levels. Exceptions are BinarySearch and SobelFilter, where our simulator is up to 10x slower than Multi2Sim, and sgemm, where our simulator is 8.8x faster. While this disparity is due to implementation differences between the simulators and simulated architectures, the results demonstrate that accurate full-system simulation of a GPU platform is feasible and yields competitive performance. Fig. 7 shows simulation slowdown over native execution. The average slowdown is 4561x.

Full instrumentation of the GPU simulation generally adds <5% overhead, due to the approach described in Section IV. This means that we provide useful statistics, with performance similar to Multi2Sim’s, which by default only reports instruction breakdown and job dimensions. In cycle-accurate mode, Multi2Sim reports additional statistics, including active execution units, compute unit occupancy, and stream core utilization, however, in our tests it failed to complete the majority of workloads, due to large inputs. On smaller workloads, we observe slowdowns of up to 10x over functional simulation.

2) CPU OpenCL Driver Simulation Speed: Full-system GPU simulation, executing the full software stack on the CPU, adds substantial stress to CPU simulation. Fig. 9 shows software stack runtimes for Sobel Filter with different input sizes. While Multi2Sim spends >150s on CPU-side execution for the largest tested input, our JIT-based CPU simulator executes the entire stack in <10s, resulting in better performance, while maintaining complete accuracy. Overall, Fig. 7 shows that slowdown for the entire system over native hardware is low, averaging only 223x slowdown.

3) Simulation Performance Optimizations: In Fig. 10 we evaluate the performance optimization introduced in Section III-B, mapping GPU SCs onto multiple host threads. In the worst case, BinarySearch is iterative, with short kernels executing with heavy CPU interaction, limiting improvement. For SobelFilter, the best case, large thread-group sizes executed for a single kernel enable efficient parallel execution, resulting in steady speedup as host threads are added.

C. Application Results

We first focus on architectural features of Bifrost which would be useful in early design space exploration.

Fig. 7: Simulation slowdown relative to the HiKey960 for GPU only, and for the entire benchmark (CPU+GPU).

Fig. 8: Our simulator’s speed with and without instrumentation, relative to Multi2Sim functional simulation (=1.0).

Fig. 9: The software stack executing on our DBT CPU simulator scales exceptionally well relative to Multi2Sim.

Fig. 10: Increasing the number of host simulation threads yields vast performance improvements for certain benchmarks.
1) Identifying Empty Instruction Slots: Fig. 11 shows instruction mixes for OpenCL benchmarks. For example, SobelFilter is a compute-intensive filter with very few empty slots and memory accesses and almost no control flow. In contrast, the number of empty slots in Reduction and ScanLargeArrays indicates low GPU utilization. On average, 50% of instructions are arithmetic operations, while local memory and control flow each contribute around 10%. Performance can be substantially improved by reducing the number of empty instruction slots introduced by the OpenCL toolchain.

2) Moving Data Closer the Core: Different types of data storage have various access latencies, which when poorly utilized can lead to colossal drops in performance. Ideally, data should be kept as close to the GPU’s execution cores as possible. Our simulator shows exact data placement throughout the hierarchy, and can be used to guide optimization.

Data breakdowns are shown in Fig. 12. SobelFilter exhibits few main memory accesses, while the figures for backprop suggest that it could benefit from enhancements to the OpenCL compiler, more registers, or a better algorithm. Fast accesses to temporary values, constants and ROM dominate. More reads from than writes to global registers suggest effective reuse of register data. Global memory accesses account for <10% of accesses, except for a single case, backprop.

3) Evaluating the Bifrost Clause Model: Clauses contain up to 8 instruction words (16 instructions), which execute unconditionally. Longer clauses are preferable - they reduce global register file accesses through temporary register use and limit the scope for control flow and thread divergence.

Fig. 13 shows the distribution of clause sizes for all benchmarks. Several, including BinomialOption and FloydWarshall exhibit a majority of clauses of size 1 or 2, and occasionally size 8. Others peak at mid-size clauses, e.g. BitonicSort, or are bimodally distributed, e.g. RecursiveGaussian. Compare this to the instruction mix in Fig. 11, where e.g. RecursiveGaussian features a larger fraction of arithmetic instructions and few empty slots, whereas Reduction is reversed. Overall, kernels with larger clauses feature fewer empty slots, while short clauses and empty slots show some correlation.

Potentially, some kernels perform little work between control flow operations, or the compiler is unable make use of available slots. Benchmarks with shorter clauses also display a large proportion of memory accesses, suggesting that memory bottlenecks limit the potential of the clause model. The model might suit graphics workloads, as they benefit from additional data processing units and exhibit regular behaviour, however re-visiting the model for compute might be worthwhile.

D. System Level Results

CPU-GPU communication can account for as much as 76% of execution time [23]. In our full-system environment, we are able to gather system-level statistics unavailable to other GPU simulators or hardware. Our approach provides the capability to observe CPU-GPU interactions, allowing us to monitor memory usage, interrupts, and control register accesses, presented in Table III for selected benchmarks. While SobelFilter exhibits little CPU-GPU interaction, BFS touches more pages, and involves a higher number of transactions.

Page use differs by up to three orders of magnitude across benchmarks, with stencil and BFS dominating this metric. BFS is particularly heavy on control interactions showing an unusually high number of control register accesses and interrupts resulting from over 1000 individual compute jobs.

E. Optimizing OpenCL Applications

1) SLAMBench: We demonstrate the capabilities of our full-system simulator by evaluating the OpenCL SLAMBench [24] computer vision application, which comprises several compute kernels and dataflow orchestrated by the CPU.
In its full configuration, SLAMBench executes 40000 kernels, impossible to simulate with existing GPU simulators out-of-the-box, due to their limitation to single kernels, tool chain incompatibilities or lack of support for CPU-GPU interactions.

Table III: System statistics detail the CPU-GPU interaction.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>BFS</td>
<td>51723</td>
<td>308098</td>
<td>66209</td>
<td>8022</td>
<td>1003</td>
</tr>
<tr>
<td>Binomial Option</td>
<td>31</td>
<td>136</td>
<td>70</td>
<td>4</td>
<td>1</td>
</tr>
<tr>
<td>SobellFilter</td>
<td>4609</td>
<td>136</td>
<td>70</td>
<td>4</td>
<td>1</td>
</tr>
<tr>
<td>Stencil</td>
<td>99605</td>
<td>14795</td>
<td>1982</td>
<td>105</td>
<td>100</td>
</tr>
</tbody>
</table>

Table III: System statistics detail the CPU-GPU interaction.

Fig. 14: Simulated SLAMBench statistics directly relate to HW performance, aiding the search for optimal configurations.

VI. RELATED WORK

To ease comparison to other GPU simulation approaches we provide an overview of features in Table IV, including each simulator’s maximum relative error as reported in their original publications. In some cases, accuracy has not been evaluated or the simulated GPUs does not model existing GPUs.

Table IV: Comparison of GPU simulation approaches.

<table>
<thead>
<tr>
<th>Simulator</th>
<th>Year</th>
<th>Features</th>
</tr>
</thead>
</table>
| GPUTSim         | 2007 | Cycle-accurate full-system CPU/GPU simulation framework, similar to our own, however the OS kernel driver is emulated, i.e. all driver calls are intercepted by the emulated driver. Similarly to Multi2Sim, any changes to the driver stack need to be implemented directly in gem5, whereas our framework supports any new drivers out of the box. Additionally, in [13] the runtime system executes natively, and only the GPU executes in the simulator. This makes it impossible to simulate systems in which the target GPU architecture differs from the host, for example, you could not simulate an Arm CPU + GPU on an x86 host. While their functional simulation is completely accurate, the cycle-accurate simulator exhibits an average error of 42%. Instruction-accurate simulation, such as our own, is a prerequisite for cycle-accurate simulation, and provides the basis for building a cycle-accurate full-system simulator, targeting any architecture. Instead, we focus solely on fast, functional simulation, enabling execution of realistic, high-intensity workloads such as SLAMBench, allowing for realistic modelling of the full system and software stack.

In 2007, Fung et al. developed a cycle-level simulator (GPUTSim) for an NVIDIA-like GPU built around the SimpleScalar back-end [7]. In this approach both target ISA and toolchain are crude approximations. In [35] a Mali GPU is modeled and OpenCL kernels are simulated using GPUTSim, yet its completeness and accuracy are insufficient for most use cases. Collange et al. have developed BARRA, an architectural simulator for the native NVIDIA G8x and G9x instruction sets [17]. With a reimplementation of the low level CUDA runtime API, they produced a simulator capable of running CUDA applications. While the target ISA is matched, the software stack is vastly different from the vendor supplied CUDA environment. GPOCCEL0T [8] supports NVIDIA’s CUDA API and implements a full function simulator providing an NVIDIA virtual machine referred to as PTX – a machine model and low level virtual ISA that is claimed to be representative of ISAs for data parallel execution. The simulator can execute compiled kernels from the CUDA compiler, but the underlying machine architecture is an abstraction of the target machine and executes a form of intermediate representation with an added-on cost model. GPUTSJAS [15] uses GPOCCEL0T to capture a GPU execution trace for further parallel simulation. Gem5-GPU [14] combines the Gem5 and GPUTSim simulators. While
Gem5-GPU is a configurable full-system simulator, it suffers from similar shortcomings as GpGPU-Sim. The GPU side of GpGPU-Sim does not accurately model a real GPU, and heavily relies on a simulator-specific software stack. ATTILA [9] is an execution-driven simulator targeting the academic ATTILA unified-shader GPU. While enabling research for GPU architectures and OPENGL application tuning, ATTILA does not model a real GPU and suffers from the lack of a full driver stack. GEMDROID [34] integrates ATTILA with the GEM5 architecture simulator, however this framework still lacks a realistic driver stack and GPU architecture. TEAPOT [32] is a trace-based GPU simulator, designed for the evaluation of mobile GPUs and has a cycle accurate GPU model for evaluating performance. TEAPOT supports OPENGL ES 1.1/2.0 and runs unmodified Android applications, but relies on the open-source GALLIUM3D drivers for a generic ‘softpipe’ GPU. [36] maps several guest GPUs onto the host system’s GPU using multiplexing in a virtual platform. However, in this approach GPU kernels are intercepted at the API level, whereas our simulator executes actual Mali binary instructions. A full-system CPU/GPU simulation framework sharing some features with our simulator has been presented in [33], however their GPU model is a simplified and generic approximation. A microarchitectural simulator for Intel’s integrated GPU has recently been described in [37], which relies on binary instrumentation of kernels for trace generation. While this allows for inspection of GPU code, insertion of tracing code modifies the GPU kernels and

<table>
<thead>
<tr>
<th>Simulator</th>
<th>Full System</th>
<th>Guest CPU</th>
<th>Guest GPU</th>
<th>GPU ISA</th>
<th>Toolchain</th>
<th>Prog. Model</th>
<th>Perf. Model</th>
<th>Simulation Model</th>
<th>Max. Rel. Error</th>
</tr>
</thead>
<tbody>
<tr>
<td>Barra [17]</td>
<td>GPU Only</td>
<td>N/A</td>
<td>NVIDIA</td>
<td>Approx.</td>
<td>Tesla ISA</td>
<td>Emulated</td>
<td>CUDA</td>
<td>Execution-Driven</td>
<td>≤ 81.6%</td>
</tr>
<tr>
<td>GpGPU-Sim [16]</td>
<td>GPU Only</td>
<td>N/A</td>
<td>NVIDIA</td>
<td>PTX</td>
<td>GT200 SASS</td>
<td>Custom</td>
<td>CUDA</td>
<td>Cycle-Accurate</td>
<td>≤ 50.0%</td>
</tr>
<tr>
<td>gem5-GPU [14]</td>
<td>Yes</td>
<td>x86</td>
<td>NVIDIA</td>
<td>PTX</td>
<td>GT200 SASS</td>
<td>Custom</td>
<td>CUDA</td>
<td>Cycle-Accurate</td>
<td>≤ 22.0%</td>
</tr>
<tr>
<td>Multi2Sim [10]</td>
<td>Yes</td>
<td>x86/Arm/</td>
<td>AMD Everg./S.Isl.</td>
<td>AMD GCN1</td>
<td>SASS</td>
<td>Custom</td>
<td>OpenCL</td>
<td>CUDA</td>
<td>Cycle-Accurate</td>
</tr>
<tr>
<td>Multi2Sim Kepler [11]</td>
<td>Yes</td>
<td>x86/Arm/</td>
<td>NVIDIA Kepler</td>
<td>SASS</td>
<td>Custom</td>
<td>CUDA</td>
<td>Cycle-Accurate</td>
<td>Execution-Driven</td>
<td>≤ 200%</td>
</tr>
<tr>
<td>ATTILA [9]</td>
<td>GPU Only</td>
<td>N/A</td>
<td>ATTILA</td>
<td>PTX</td>
<td>Custom</td>
<td>CUDA</td>
<td>Cycle-Accurate</td>
<td>Execution-Driven</td>
<td>N/A</td>
</tr>
<tr>
<td>GPUOcelot [8]</td>
<td>GPU Only</td>
<td>N/A</td>
<td>NVIDIA</td>
<td>PTX</td>
<td>Custom</td>
<td>CUDA</td>
<td>Cycle-Accurate</td>
<td>Execution-Driven</td>
<td>N/A</td>
</tr>
<tr>
<td>HS/Aemue [30]</td>
<td>Yes</td>
<td>Retargetable/ Arm-v7A</td>
<td>Generic</td>
<td>HSAIL</td>
<td>Custom</td>
<td>OpenCL</td>
<td>Cycle-Accurate</td>
<td>Execution-Driven</td>
<td>N/A</td>
</tr>
<tr>
<td>GPUTejas [15]</td>
<td>GPU Only</td>
<td>N/A</td>
<td>NVIDIA</td>
<td>PTX</td>
<td>GPUOcelot μ-ops</td>
<td>Custom</td>
<td>CUDA</td>
<td>Cycle-Accurate</td>
<td>≤ 29.7%</td>
</tr>
<tr>
<td>MacSim [31]</td>
<td>Yes</td>
<td>x86</td>
<td>NVIDIA GeForce</td>
<td>Tesla-like</td>
<td>PTX</td>
<td>GPUOcelot μ-ops</td>
<td>Custom</td>
<td>CUDA</td>
<td>Cycle-Accurate</td>
</tr>
<tr>
<td>TEAPOT [32]</td>
<td>Yes</td>
<td>Generic</td>
<td>Generic Mobile GPU</td>
<td>Emulated</td>
<td>Custom</td>
<td>OpenCL</td>
<td>Cycle-Accurate</td>
<td>Trace-Driven</td>
<td>Not Evaluated</td>
</tr>
<tr>
<td>QEMU/MARSSx86/ PTLsim [33]</td>
<td>Yes</td>
<td>x86</td>
<td>NVIDIA</td>
<td>Tesla-like</td>
<td>Generic</td>
<td>Custom</td>
<td>OpenCL</td>
<td>Cycle-Accurate</td>
<td>Not Evaluated</td>
</tr>
<tr>
<td>GemDroid [34]</td>
<td>Yes</td>
<td>x86/Arm-v7A</td>
<td>ATTILA</td>
<td>ARB</td>
<td>Custom</td>
<td>OpenCL</td>
<td>Cycle-Accurate</td>
<td>Execution-Driven</td>
<td>N/A</td>
</tr>
<tr>
<td>GCN3 Simulator [13]</td>
<td>Yes</td>
<td>x86</td>
<td>AMD Pro A12-8800B APU</td>
<td>GCN3</td>
<td>Vendor</td>
<td>ROCM</td>
<td>Cycle-Accurate</td>
<td>Execution-Driven</td>
<td>~42%</td>
</tr>
<tr>
<td>Our Simulator</td>
<td>Yes</td>
<td>Retargetable/ Arm-v7A/8A</td>
<td>Retargetable/ Arm Mali-G71</td>
<td>Retargetable/ Native Binary</td>
<td>Vendor</td>
<td>Any/ OpenCL</td>
<td>Instruction-Accurate</td>
<td>Execution-Driven</td>
<td>0.0%</td>
</tr>
</tbody>
</table>

1 Maximum error of a performance metric reported in the original publication.
2 Original publication does not provide an accuracy evaluation against a hardware implementation of the simulated GPU.

TABLE IV: Feature comparison of existing GPU simulators. Our simulator is the only full-system CPU/GPU mobile platform simulator capable of hosting an unmodified GPU software stack and supporting true GPU native code execution.
interferes with their execution. Similarly, SASSI [38] provides binary instrumentation for NVIDIA kernels, [39] parallelizes GPGPU-Sim, but is limited by cycle-level synchronization points, unlike our functional simulator.

VII. SUMMARY & CONCLUSION

In this paper we have presented the first ever fully re-targetable full-system simulator supporting an unmodified software stack for a commercially available, state-of-the-art mobile GPU. Its validated instruction-accurate performance model enables more accurate insights into the GPU’s operation than with simulators claiming cycle-accuracy for crudely approximated architectures and non-standard runtime environments. Our full-system approach will ensure a long-lasting simulator, requiring little maintenance as new toolchains are released. While we draw on several known simulation techniques, we have demonstrated the feasibility of accurate full-system CPU/GPU simulation at performance levels comparable to or better than those of existing, less accurate simulators. Our simulation approach enables us to gain insights into mobile GPU workloads including system-level transactions between the CPU and GPU - inaccessible using other GPU simulation approaches. Our simulator can characterize mobile GPU applications with accuracy unavailable using existing GPU simulators and provides a useful tool to researchers and developers alike.

A. Future Work & Software Release

Future work will include 3D graphics support, further performance optimizations, e.g. JIT-compiled execution of GPU code, and micro-architectural performance modeling and simulation based design space exploration of machine learning and computer vision enabled mobile GPUs.

Our simulator has been made publicly available [40] to facilitate further research and development of mobile GPU architectures based on accurate simulation tools.

REFERENCES


