Modern branch predictors are often too large and power hungry to be a viable option for small, embedded processors where die space, power consumption and performance are all at a premium. With embedded processors the large cache structures required for high performance branch prediction can easily take up more die space than the rest of the processor combined. When coupled with the large leakage energies, which are set to be an increasing issue as technologies advance to 45nm and beyond, it can often appear appealing to not use a dynamic branch predictor at all. This paper seeks to find a way of using an ultra small branch predictor in a hybrid predictor configuration suitable for an embedded processor. We introduce a novel bias parameter to the consideration of when to execute branches statically or dynamically, further exploring the performance vs energy trade-off. We present a solution that reduces dynamic branch predictor aliasing, improves performance and requires a minimum of extra die space. The results presented relate die space requirements, energy use and performance impacts. We look at how best to optimise this balance in a way that is usually not considered, and on a lower bits budget than has previously been presented. The EEMBC 1.1 benchmark suite  was used to explore the energy vs performance trade-off boundary, taking averages of the results across 31 different benchmarks. We evaluate 5 traditional branch predictor configurations and 36 novel ultra small hybrid branch predictors through the use of 9 sets of our novel bias values, combining GShare dynamic predictions with profiled backwards taken forwards not-taken (BTFN)/ backwards not-taken forwards taken (BNFT) static predictions. The results demonstrate that the use of a static-dynamic hybrid is not only beneficial but necessary for very small predictors to produce a positive effect on the cycle count and overall energy use of the processor. Through the use of our novel bias parameter we explore the performance vs energy trade-off and show that through a small (0.1 seconds at 500MHz or 0.35%) reduction in peak performance (total runtime in region of 28.35 seconds) for a given architecture we can gain substantial dynamic energy savings from reduced dynamic predictor accesses (removing up to an additional 16.5%, or 53 million, of the traditional hybrid predictor accesses). Our best performing architecture showed an average improvement in run time of 2 seconds (6.7%) over a static BTFN baseline (total runtime 30.46s), at the cost of only an additional 0.01mm2 (or 1%) die space.
|Title of host publication||Architecture of Computing Systems – ARCS 2012|
|Subtitle of host publication||25th International Conference, Munich, Germany, February 28 - March 2, 2012. Proceedings|
|Publication status||Published - 2012|
|Name||Lecture Notes in Computer Science|