A general purpose synchronization mechanism for a parallel computer should allow arbitrary, data-dependent, dynamically partitioned groups of processors to remain internally synchronized, while proceeding asynchronously with respect to other groups. We present an algorithm which can support such a scheme. The algorithm constructs binary synchronization trees for the sub-groups, given a group of processors and a [0,1] label for each processor, and is valid for any network. We provide a general complexity analysis in terms of operations on the synchronization trees which is then instantiated with respect to the n x n processor 2D mesh architecture. We show that the algorithm constructs a synchronization tree for any sub-group of s processors in O(n log s) parallel communication steps with high probability. We present lower bounds on achievable performance based on the mesh indexing scheme used: row/column major indexing schemes require [omega](n log n) parallel communication steps in the worst case, whereas the recursive Hilbert indexing scheme requires [omega](n[square root of log n]) parallel communication steps. Experimental results are given validating the analysis. Our algorithm has applications in implementations of PRAMs (e.g. conditional instructions) and of nested data parallelism (or mixed data/task parallelism) on distributed processor networks.
|Publisher||Edinburgh : University of Edinburgh, Dept. of Computer Science, Computer Systems Group, |
|Publication status||Published - 1996|