TY - JOUR
T1 - Ontology-driven and weakly supervised rare disease identification from clinical notes
AU - Dong, Hang
AU - Suárez-Paniagua, Víctor
AU - Zhang, Huayu
AU - Wang, Minhong
AU - Casey, Arlene
AU - Davidson, Emma
AU - Chen, Jiaoyan
AU - Alex, Beatrice
AU - Whiteley, William
AU - Wu, Honghan
N1 - Funding Information:
This work is supported by Health Data Research UK National Phenomics and Text Analytics Implementation Projects, Wellcome Institutional Translation Partnership Awards (PIII009, PIII029, PIII032, PIII054), Medical Research Council and Health Data Research UK (MR/S004149/1). HZ and AC are supported by the Advanced Care Research Centre (ACRC). HD and JC are also supported by EPSRC project ConCur on Knowledge Graph Construction and Curation (EP/V050869/1). The funding bodies are independent of the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Publisher Copyright:
© 2023, The Author(s).
PY - 2023/5/5
Y1 - 2023/5/5
N2 - BACKGROUND: Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts.METHODS: We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-driven framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations.RESULTS: The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes).CONCLUSION: The study provides empirical evidence for the task by applying a weakly supervised NLP pipeline on clinical notes. The proposed weak supervised deep learning approach requires no human annotation except for validation and testing, by leveraging ontologies, NER+L tools, and contextual representations. The study also demonstrates that Natural Language Processing (NLP) can complement traditional ICD-based approaches to better estimate rare diseases in clinical notes. We discuss the usefulness and limitations of the weak supervision approach and propose directions for future studies.
AB - BACKGROUND: Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts.METHODS: We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-driven framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations.RESULTS: The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes).CONCLUSION: The study provides empirical evidence for the task by applying a weakly supervised NLP pipeline on clinical notes. The proposed weak supervised deep learning approach requires no human annotation except for validation and testing, by leveraging ontologies, NER+L tools, and contextual representations. The study also demonstrates that Natural Language Processing (NLP) can complement traditional ICD-based approaches to better estimate rare diseases in clinical notes. We discuss the usefulness and limitations of the weak supervision approach and propose directions for future studies.
U2 - 10.1186/s12911-023-02181-9
DO - 10.1186/s12911-023-02181-9
M3 - Article
C2 - 37147628
SN - 1472-6947
VL - 23
JO - Bmc medical informatics and decision making
JF - Bmc medical informatics and decision making
IS - 1
ER -