The significance of molecular heterogeneity in breast cancer batch correction and dataset integration

Nicholas Moir*, Dominic A. Pearce, Simon P. Langdon, T. Ian Simpson

*Corresponding author for this work

Research output: Contribution to journalArticle

Abstract

Breast cancer research benefits from a substantial collection of gene expression datasets that are commonly integrated to increase analytical power. Gene expression batch effects arising between experimental batches, where signal differences confound true biological variation, must be addressed when integrating datasets and several approaches exist to address these technical differences. This brief communication study clearly demonstrates that popular batch correction techniques can significantly distort key biomarker expression signals. Through the implementation of ComBat batch correction and evaluation of integrated expression values, we profile the extent of these distortions and consider an additional mitigatory batch correction step. We demonstrate that leveraging a priori knowledge of sample molecular subtype classification can optimally remove batch effect distortion while preserving key biomarker expression variation and transcriptional legitimacy. To the best of our knowledge, this study presents the first analysis of the interplay between dataset molecular composition and the concomitant robustness of integrated, batch-corrected biological expression signal.
Original languageEnglish
JournalmedrXiv
DOIs
Publication statusPublished - 26 Dec 2024

Fingerprint

Dive into the research topics of 'The significance of molecular heterogeneity in breast cancer batch correction and dataset integration'. Together they form a unique fingerprint.

Cite this