It is widely accepted that for codes to take advantage of exa-scale systems, programmers will need to work at different level of parallelism. Inevitably this will involve combining different programming technologies in a code and these interoperating together. An example of this is task based models, which work well in shared memory, but to scale these out to distributed memory machines requires the combination of other technologies such as MPI. We have developed a directory/cache which can be integrated with higher level programming language runtimes and enables transparent (to the end user) interoperability between shared memory and distributed memory technologies such as task based models and MPI. The directory/cache presents a single, unified global view of memory to users, abstracting them from the underlying complexity of how memory is physically distributed across nodes and issues such as the uneven decomposition of data. Internally MPI RMA is used as the underlying transport layer, the ubiquity and predictable performance of MPI RMA meant that this was a natural choice. We illustrate how this RMA transport layer has been implemented and several recent features from MPI version 3 standard that we rely upon to support transparent data movement. Abstraction can come at a cost to performance, to understand this a benchmark (block Cholesky matrix factorisation) has been developed and we present performance & scaling results in comparison to direct passive and active target synchronisation MPI RMA implementations. This benchmark is also used to briefly illustrate how the directory/cache might be used in code.
|Number of pages||2|
|Publication status||Published - Jun 2017|
|Event||International Supercomputing Conference - Frankfurt, Germany|
Duration: 26 Jun 2017 → 29 Jun 2017
|Conference||International Supercomputing Conference|
|Period||26/06/17 → 29/06/17|
- Parallel programming