Projects per year
Abstract
Keywords: HPC, Big Data, Genomics, SPRINT, Parallelisation
We here present computation performance (CPU time, memory requirements) increases we can obtain in
the analysis of large biological (or other) data sets through use of the SPRINT package (www.r-sprint.org).
With the arrival of “big data” (microarrays, screens, next-generation sequencing) in the life sciences, standard analyses of these data for regular users of R now run into severe issues of computation time or computer memory. Many projects (including parallelisation efforts of the R core) offer R packages and functions that allow programming of solutions for large-scale analysis problems. However, these usually require familiarity with HPC programming as well as sufficient and funded time to employ, which is feasible for one-off analysis problems but impractical for common analysis methods.
To make High Performance Computing (HPC) solutions available to R users without HPC experience, we started development on the SPRINT package in 2008. It allows these users straightforward use of already implemented parallelised versions of many relevant R functions on multi-core Macs as well as large-scale clusters/HPC platforms like the UK’s HECToR or ARCHER (we have also tested on Amazon Elastic Compute Cloud). In addition to addressing speed-critical problems, we also address memory-critical problems.
We will here introduce recent upgrades to SPRINT, discuss for regular R users how to use SPRINT and for users with HPC background how our parallelisation strategies are particularly aimed at problems that go beyond ‘simple’ task farming. We outline case examples for use of SPRINT as well as performance and limitations of our approach in context of biological high-throughput data (although most individual functions are generically usable for other larger data sets).
Based on our needs and those we established in R user surveys, we currently support parallelised versions [1] of original [2] functions (our function names add prefix ‘p’, apart from pmaxt, which is based on mt.maxT) that are essential in clustering, classification and non-parametric statistics when applied to very large data sets: pstringdistmatrix, pboot, papply, pcor, ppam, prandomForest, pmaxt, pRP, psvm.
References
[1] Publications of our function implementations can be found on www.r-sprint.org -> Publications
[2] Source citations for these packages can be found on www.r-sprint.org -> Overview and R functions
We here present computation performance (CPU time, memory requirements) increases we can obtain in
the analysis of large biological (or other) data sets through use of the SPRINT package (www.r-sprint.org).
With the arrival of “big data” (microarrays, screens, next-generation sequencing) in the life sciences, standard analyses of these data for regular users of R now run into severe issues of computation time or computer memory. Many projects (including parallelisation efforts of the R core) offer R packages and functions that allow programming of solutions for large-scale analysis problems. However, these usually require familiarity with HPC programming as well as sufficient and funded time to employ, which is feasible for one-off analysis problems but impractical for common analysis methods.
To make High Performance Computing (HPC) solutions available to R users without HPC experience, we started development on the SPRINT package in 2008. It allows these users straightforward use of already implemented parallelised versions of many relevant R functions on multi-core Macs as well as large-scale clusters/HPC platforms like the UK’s HECToR or ARCHER (we have also tested on Amazon Elastic Compute Cloud). In addition to addressing speed-critical problems, we also address memory-critical problems.
We will here introduce recent upgrades to SPRINT, discuss for regular R users how to use SPRINT and for users with HPC background how our parallelisation strategies are particularly aimed at problems that go beyond ‘simple’ task farming. We outline case examples for use of SPRINT as well as performance and limitations of our approach in context of biological high-throughput data (although most individual functions are generically usable for other larger data sets).
Based on our needs and those we established in R user surveys, we currently support parallelised versions [1] of original [2] functions (our function names add prefix ‘p’, apart from pmaxt, which is based on mt.maxT) that are essential in clustering, classification and non-parametric statistics when applied to very large data sets: pstringdistmatrix, pboot, papply, pcor, ppam, prandomForest, pmaxt, pRP, psvm.
References
[1] Publications of our function implementations can be found on www.r-sprint.org -> Publications
[2] Source citations for these packages can be found on www.r-sprint.org -> Overview and R functions
Original language | English |
---|---|
Title of host publication | The R User Conference 2014 |
Place of Publication | Los Angeles |
Pages | 1 |
Number of pages | 1 |
Publication status | Published - 1 Jul 2014 |
Event | The R User Conference 2014 - UCLA, Los Angeles, United States Duration: 30 Jun 2014 → 3 Jul 2014 |
Conference
Conference | The R User Conference 2014 |
---|---|
Country/Territory | United States |
City | Los Angeles |
Period | 30/06/14 → 3/07/14 |
Fingerprint
Dive into the research topics of 'Using SPRINT and parallelised functions for analysis of large data on multi-core Mac and HPC platforms'. Together they form a unique fingerprint.Projects
- 1 Finished
-
The SPRINT approach to network biology
Ghazal, P. (Principal Investigator), Sloan, T. (Co-investigator), Cebamanos, L. (Researcher), Forster, T. (Researcher), Mitchell, L. (Researcher), Robertson, K. (Researcher) & Troup, E. (Researcher)
1/10/12 → 30/09/14
Project: Research