The Simple Parallel R INTerface (SPRINT) offers both a parallel functions library and an interface for adding parallel functions to the R language and environment for statistical computing and graphics. The aims of this project were:
- Optimise the randomForest decision tree classifier for parallel implementation on HECToR, the UK's national supercomputing service, and then make it available for general R usage on HECToR through SPRINT.
- Analyse the performance of SPRINT's rank product and optimise.
- Benchmark both randomForest and rank product for up to 512 processes
- A parallel wrapper was added around the serial randomForest algorithm along with a tree reduction approach for combining results in parallel. For typical cases, a 40 times speed up can now be achieved. However, the serial randomForest code was designed for datasets with fewer variables than in bioscientific cases, which limits scalability to around 64 processes.
- A task parallel method using the existing serial rank product calculation was also implemented by distributing the bootstrap samples. For certain problem sizes, excellent scalability was shown for 512+ processes.