Abstract
A common problem in biomedical research is to calculate the sample size required to learn a classifier using a (possibly highdimensional) panel of biomarkers. This paper describes a simple method based on a Gaussian approximation for calculating the predictive performance of the learned classifier given the size of the biomarker panel, the size of the training sample, and the optimal predictive performance (expressed as Cstatistic [Formula: see text]) of the biomarker panel that could be obtained if a training sample of unlimited size were available. Under the assumption that the biomarker effect sizes have the same correlation structure as the biomarkers, the required sample size does not depend upon these correlations, but only upon [Formula: see text] and upon the sparsity of the distribution of effect sizes, defined by the proportion of biomarkers that have nonzero effects. To learn a classifier that extracts 80% of the predictive information, the required case sample size varies from about 0.1 cases per variable for a panel with [Formula: see text] and a sparse distribution of effect sizes (such that 1% of biomarkers have nonzero effect sizes) to nine cases per variable for a panel with [Formula: see text] and a diffuse distribution of effect sizes.
Original language  English 

Pages (fromto)  962280217738807 
Journal  Statistical Methods in Medical Research 
Early online date  28 Nov 2017 
DOIs  
Publication status  Epub ahead of print  28 Nov 2017 
Keywords
 Journal Article
Fingerprint Dive into the research topics of 'Sample size requirements for learning to classify with highdimensional biomarker panels'. Together they form a unique fingerprint.
Profiles

Paul McKeigue
 Deanery of Molecular, Genetic and Population Health Sciences  Chair of Genetic Epidemiology and Statistical Genetics
 Usher Institute
 Centre for Population Health Sciences
Person: Academic: Research Active