TY - GEN
T1 - Leveraging MPI RMA to optimise halo-swapping communications in MONC on Cray machines
AU - Brown, Nicholas
AU - Bareford, Michael
AU - Weiland, Michele
PY - 2018/5/24
Y1 - 2018/5/24
N2 - Remote Memory Access (RMA), also known as single sided communications, provides a way for reading and writing directly into the memory of other processes without having to issue explicit message passing style communication calls. Previous studies have concluded that MPI RMA can provide increased communication performance over traditional MPI Point to Point (P2P) but these are based on synthetic benchmarks rather than real world codes. In this work, we replace the existing non-blocking P2P communication calls in the Met Office NERC Cloud model, a mature code for modelling the atmosphere, with MPI RMA. We describe our approach in detail and discuss the options taken for correctness and performance. Experiments are performed on ARCHER, a Cray XC30 and Cirrus, an SGI ICE machine. We demonstrate on ARCHER that by using RMA we can obtain between a 10-20\% reduction in communication time at each timestep on up to 32768 cores, which over the entirety of a run (with many timesteps) results in a significant improvement in performance compared to P2P on the Cray. However, RMA is not a silver bullet and there are challenges when integrating RMA calls into existing codes: important optimisations are necessary to achieve good performance and library support is not universally mature, as is the case on Cirrus. In this paper we discuss, in the context of a real world code, the lessons learned converting P2P to RMA, explore performance and scaling challenges, and contrast alternative RMA synchronisation approaches in detail.
AB - Remote Memory Access (RMA), also known as single sided communications, provides a way for reading and writing directly into the memory of other processes without having to issue explicit message passing style communication calls. Previous studies have concluded that MPI RMA can provide increased communication performance over traditional MPI Point to Point (P2P) but these are based on synthetic benchmarks rather than real world codes. In this work, we replace the existing non-blocking P2P communication calls in the Met Office NERC Cloud model, a mature code for modelling the atmosphere, with MPI RMA. We describe our approach in detail and discuss the options taken for correctness and performance. Experiments are performed on ARCHER, a Cray XC30 and Cirrus, an SGI ICE machine. We demonstrate on ARCHER that by using RMA we can obtain between a 10-20\% reduction in communication time at each timestep on up to 32768 cores, which over the entirety of a run (with many timesteps) results in a significant improvement in performance compared to P2P on the Cray. However, RMA is not a silver bullet and there are challenges when integrating RMA calls into existing codes: important optimisations are necessary to achieve good performance and library support is not universally mature, as is the case on Cirrus. In this paper we discuss, in the context of a real world code, the lessons learned converting P2P to RMA, explore performance and scaling challenges, and contrast alternative RMA synchronisation approaches in detail.
KW - MONC
KW - MPI RMA
KW - Remote Memory Access
KW - Passive target synchronisation
KW - MPI fences
KW - MPI PSCW
KW - Halo swapping
M3 - Conference contribution
BT - CUG 2018
T2 - Cray User Group (CUG) 2018
Y2 - 20 May 2018 through 24 May 2018
ER -