TY - JOUR
T1 - nestedcv
T2 - an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data
AU - Lewis, Myles J
AU - Spiliopoulou, Athina
AU - Goldmann, Katriona
AU - Pitzalis, Costantino
AU - McKeigue, Paul
AU - Barnes, Michael R
N1 - Funding Information:
This work has been supported by NIHR (grant 131575) and MRC (MR/V012509/1).
Publisher Copyright:
© 2023 The Author(s). Published by Oxford University Press.
PY - 2023/4/13
Y1 - 2023/4/13
N2 - Motivation: Although machine learning models are commonly used in medical research, many analyses implement a simple partition into training data and hold-out test data, with cross-validation (CV) for tuning of model hyperparameters. Nested CV with embedded feature selection is especially suited to biomedical data where the sample size is frequently limited, but the number of predictors may be significantly larger (P ≫ n). Results: The nestedcv R package implements fully nested k × l-fold CV for lasso and elastic-net regularized linear models via the glmnet package and supports a large array of other machine learning models via the caret framework. Inner CV is used to tune models and outer CV is used to determine model performance without bias. Fast filter functions for feature selection are provided and the package ensures that filters are nested within the outer CV loop to avoid information leakage from performance test sets. Measurement of performance by outer CV is also used to implement Bayesian linear and logistic regression models using the horseshoe prior over parameters to encourage a sparse model and determine unbiased model accuracy.
AB - Motivation: Although machine learning models are commonly used in medical research, many analyses implement a simple partition into training data and hold-out test data, with cross-validation (CV) for tuning of model hyperparameters. Nested CV with embedded feature selection is especially suited to biomedical data where the sample size is frequently limited, but the number of predictors may be significantly larger (P ≫ n). Results: The nestedcv R package implements fully nested k × l-fold CV for lasso and elastic-net regularized linear models via the glmnet package and supports a large array of other machine learning models via the caret framework. Inner CV is used to tune models and outer CV is used to determine model performance without bias. Fast filter functions for feature selection are provided and the package ensures that filters are nested within the outer CV loop to avoid information leakage from performance test sets. Measurement of performance by outer CV is also used to implement Bayesian linear and logistic regression models using the horseshoe prior over parameters to encourage a sparse model and determine unbiased model accuracy.
U2 - 10.1093/bioadv/vbad048
DO - 10.1093/bioadv/vbad048
M3 - Article
C2 - 37113250
SN - 2635-0041
VL - 3
SP - vbad048
JO - Bioinformatics Advances
JF - Bioinformatics Advances
IS - 1
M1 - vbad048
ER -