TY - JOUR
T1 - Scalable analysis of multi-modal biomedical data
AU - Smith, Jaclyn
AU - Shi, Yao
AU - Benedikt, Michael
AU - Nikolic, Milos
N1 - Publisher Copyright:
© 2021 The Author(s) 2021. Published by Oxford University Press GigaScience.
PY - 2021/9/11
Y1 - 2021/9/11
N2 - Background: Targeted diagnosis and treatment options are dependent on insights drawn from multi-modal analysis of large-scale biomedical datasets. Advances in genomics sequencing, image processing, and medical data management have supported data collection and management within medical institutions. These efforts have produced large-scale datasets and have enabled integrative analyses that provide a more thorough look of the impact of a disease on the underlying system. The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis. Thus, scalable data integration solutions play a key role in the future of targeted medicine. Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex datatypes. Solution: To address these issues and achieve scalable processing of multi-modal biomedical data, we present TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types. Performance: We outline research and clinical applications for the platform, including data integration support for building feature sets for classification. We show that the system is capable of outperforming the common alternative, based on "flattening"complex data structures, and runs efficiently when alternative approaches are unable to perform at all.
AB - Background: Targeted diagnosis and treatment options are dependent on insights drawn from multi-modal analysis of large-scale biomedical datasets. Advances in genomics sequencing, image processing, and medical data management have supported data collection and management within medical institutions. These efforts have produced large-scale datasets and have enabled integrative analyses that provide a more thorough look of the impact of a disease on the underlying system. The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis. Thus, scalable data integration solutions play a key role in the future of targeted medicine. Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex datatypes. Solution: To address these issues and achieve scalable processing of multi-modal biomedical data, we present TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types. Performance: We outline research and clinical applications for the platform, including data integration support for building feature sets for classification. We show that the system is capable of outperforming the common alternative, based on "flattening"complex data structures, and runs efficiently when alternative approaches are unable to perform at all.
KW - distributed processing
KW - multi-modal data integration
KW - multi-omics analysis
KW - nested data
KW - query compilation
KW - Spark
UR - http://www.scopus.com/inward/record.url?scp=85116463014&partnerID=8YFLogxK
U2 - 10.1093/gigascience/giab058
DO - 10.1093/gigascience/giab058
M3 - Article
C2 - 34508579
AN - SCOPUS:85116463014
SN - 2047-217X
VL - 10
JO - GigaScience
JF - GigaScience
IS - 9
M1 - giab058
ER -