Abstract
The memory wall places a significant limit on performance for many modern workloads. These applications feature complex chains of dependent, indirect memory accesses, which cannot be picked up by even the most advanced microarchitectural prefetchers. The result is that current out-of-order superscalar processors spend the majority of their time stalled. While it is possible to build special-purpose architectures to exploit the fundamental memory-level parallelism, a microarchitectural technique to automatically improve their performance in conventional processors has remained elusive.
Runahead execution is a tempting proposition for hiding latency in program execution. However, to achieve high memory-level parallelism, a standard runahead execution skips ahead of cache misses. In modern workloads, this means it only prefetches the first cache-missing load in each dependent chain. We argue that this is not a fundamental limitation. If runahead were instead to stall on cache misses to generate dependent chain loads, then it could regain performance if it could stall on many at once. With this insight, we present Vector Runahead, a technique that prefetches entire load chains and speculatively reorders scalar operations from multiple loop iterations into vector format to bring in many independent loads at once. Vectorization of the runahead instruction stream increases the effective fetch/decode bandwidth with reduced resource requirements, to achieve high degrees of memory-level parallelism at a much faster rate. Across a variety of memory-latency-bound indirect workloads, Vector Runahead achieves a 1.79×performance speedup on a large out-of-order superscalar system, significantly improving on state-of-the-art techniques.
Runahead execution is a tempting proposition for hiding latency in program execution. However, to achieve high memory-level parallelism, a standard runahead execution skips ahead of cache misses. In modern workloads, this means it only prefetches the first cache-missing load in each dependent chain. We argue that this is not a fundamental limitation. If runahead were instead to stall on cache misses to generate dependent chain loads, then it could regain performance if it could stall on many at once. With this insight, we present Vector Runahead, a technique that prefetches entire load chains and speculatively reorders scalar operations from multiple loop iterations into vector format to bring in many independent loads at once. Vectorization of the runahead instruction stream increases the effective fetch/decode bandwidth with reduced resource requirements, to achieve high degrees of memory-level parallelism at a much faster rate. Across a variety of memory-latency-bound indirect workloads, Vector Runahead achieves a 1.79×performance speedup on a large out-of-order superscalar system, significantly improving on state-of-the-art techniques.
Original language | English |
---|---|
Title of host publication | 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) |
Publisher | IEEE |
Number of pages | 14 |
ISBN (Electronic) | 978-1-6654-3333-4 |
ISBN (Print) | 978-1-6654-3334-1 |
DOIs | |
Publication status | Published - 4 Aug 2021 |
Event | 48th International Symposium on Computer Architecture - Online Duration: 14 Jun 2021 → 19 Jun 2021 https://iscaconf.org/isca2021/ |
Publication series
Name | |
---|---|
ISSN (Print) | 1063-6897 |
ISSN (Electronic) | 2575-713X |
Symposium
Symposium | 48th International Symposium on Computer Architecture |
---|---|
Abbreviated title | ISCA 2021 |
Period | 14/06/21 → 19/06/21 |
Internet address |