TY - JOUR
T1 - Relevance, Redundancy and Complementarity Trade-off (RRCT)
T2 - a Principled, Generic, Robust Feature Selection Tool
AU - Tsanas, Thanasis
N1 - Funding Information:
A.T. is grateful to Drs. Max Little (MIT) and Patrick McSharry (University of Oxford) for early discussions on feature selection and a preliminary draft many years ago. This work was supported by the Health Data Research UK, which receives its funding from HDR UK, Ltd. (HDR-5012), funded by the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health & Social Care (UK), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh government), Public Health Agency (Northern Ireland), British Heart Foundation, and the Wellcome Trust. The author declares no competing interests.
Publisher Copyright:
© 2022 The Author(s)
PY - 2022/5/13
Y1 - 2022/5/13
N2 - We present a new heuristic feature-selection (FS) algorithm that integrates in a principled algorithmic framework the three key FS components: relevance, redundancy, and complementarity. Thus, we call it relevance, redundancy, and complementarity trade-off (RRCT). The association strength between each feature and the response and between feature pairs is quantified via an information theoretic transformation of rank correlation coefficients, and the feature complementarity is quantified using partial correlation coefficients. We empirically benchmark the performance of RRCT against 19 FS algorithms across four synthetic and eight real-world datasets in indicative challenging settings evaluating the following: (1) matching the true feature set and (2) out-of-sample performance in binary and multi-class classification problems when presenting selected features into a random forest. RRCT is very competitive in both tasks, and we tentatively make suggestions on the generalizability and application of the best-performing FS algorithms across settings where they may operate effectively.
AB - We present a new heuristic feature-selection (FS) algorithm that integrates in a principled algorithmic framework the three key FS components: relevance, redundancy, and complementarity. Thus, we call it relevance, redundancy, and complementarity trade-off (RRCT). The association strength between each feature and the response and between feature pairs is quantified via an information theoretic transformation of rank correlation coefficients, and the feature complementarity is quantified using partial correlation coefficients. We empirically benchmark the performance of RRCT against 19 FS algorithms across four synthetic and eight real-world datasets in indicative challenging settings evaluating the following: (1) matching the true feature set and (2) out-of-sample performance in binary and multi-class classification problems when presenting selected features into a random forest. RRCT is very competitive in both tasks, and we tentatively make suggestions on the generalizability and application of the best-performing FS algorithms across settings where they may operate effectively.
KW - DSML3: Development/pre-production: Data science output has been rolled out/validated across multiple domains/problems
KW - curse of dimensionality
KW - dimensionality reduction
KW - feature selection
KW - information theory
KW - principle of parsimony
KW - statistical learning
KW - variable selection
U2 - 10.1016/j.patter.2022.100471
DO - 10.1016/j.patter.2022.100471
M3 - Article
C2 - 35607618
SN - 2666-3899
VL - 3
SP - 100471
JO - Patterns
JF - Patterns
IS - 5
M1 - 100471
ER -