TY - JOUR
T1 - Understanding the performance and reliability of NLP tools
T2 - A comparison of four NLP tools predicting stroke phenotypes in radiology reports
AU - Casey, Arlene
AU - Davidson, Emma
AU - Grover, Claire
AU - Tobin, Richard
AU - Grivas, Andreas
AU - Zhang, Huayu
AU - Schrempf, Patrick
AU - O’Neil, Alison Q.
AU - Lee, Liam
AU - Walsh, Michael
AU - Pellie, Freya
AU - Ferguson, Karen
AU - Cvoro, Vera
AU - Wu, Honghan
AU - Whalley, Heather
AU - Mair, Grant
AU - Whiteley, William
AU - Alex, Beatrice
N1 - Funding Information:
BA and AC were supported by the Turing Fellowship and Turing project (“Large-Scale and Robust Text Mining of Electronic Healthcare Records,” EP/N510129/1) from The Alan Turing Institute. HW, CG, and BA were supported by the Medical Research Council. Pathfinder Award. “Leveraging routinely collected & linked research data to study causes and consequences of common mental disorders” (MC_PC_17209). AC, BA, HW, and HZ are funded by Legal and General PLC as part of the Advanced Care Research Centre. ED is funded by the Alzheimer's Society. WW was supported by the CSO (CAF/17/01), the Alzheimer's Society, HDRUK, and the Stroke Association. This work is supported by the Industrial Centre for AI Research in digital Diagnostics (iCAIRD) which is funded by Innovate UK on behalf of UK Research and Innovation (UKRI) (project number: 104690). Generation Scotland received core support from the Chief Scientist Office of the Scottish Government Health Directorates (CZD/16/6) and the Scottish Funding Council (HR03006) and is currently supported by the Wellcome Trust (216767/Z/19/Z). Association between brain computed tomography (CT) abnormalities and delirium: a study in 3,300 older emergency patients in NHS Fife was funded by the Dunhill Medical Trust (R484/0516). The study was funded by the Legal and General Group as part of their corporate social responsibility programme, providing a research grant to establish the independent Advanced Care Research Centre at University of Edinburgh. The funder had no role in the conduct of the study, interpretation, or the decision to submit for publication. The views expressed are those of the authors and not necessarily those of Legal and General. Acknowledgments
Funding Information:
BA and AC were supported by the Turing Fellowship and Turing project (“Large-Scale and Robust Text Mining of Electronic Healthcare Records,” EP/N510129/1) from The Alan Turing Institute. HW, CG, and BA were supported by the Medical Research Council. Pathfinder Award. “Leveraging routinely collected & linked research data to study causes and consequences of common mental disorders” (MC_PC_17209). AC, BA, HW, and HZ are funded by Legal and General PLC as part of the Advanced Care Research Centre. ED is funded by the Alzheimer's Society. WW was supported by the CSO (CAF/17/01), the Alzheimer's Society, HDRUK, and the Stroke Association. This work is supported by the Industrial Centre for AI Research in digital Diagnostics (iCAIRD) which is funded by Innovate UK on behalf of UK Research and Innovation (UKRI) (project number: 104690). Generation Scotland received core support from the Chief Scientist Office of the Scottish Government Health Directorates (CZD/16/6) and the Scottish Funding Council (HR03006) and is currently supported by the Wellcome Trust (216767/Z/19/Z). Association between brain computed tomography (CT) abnormalities and delirium: a study in 3,300 older emergency patients in NHS Fife was funded by the Dunhill Medical Trust (R484/0516). The study was funded by the Legal and General Group as part of their corporate social responsibility programme, providing a research grant to establish the independent Advanced Care Research Centre at University of Edinburgh. The funder had no role in the conduct of the study, interpretation, or the decision to submit for publication. The views expressed are those of the authors and not necessarily those of Legal and General.
Publisher Copyright:
2023 Casey, Davidson, Grover, Tobin, Grivas, Zhang, Schrempf, O’Neil, Lee, Walsh, Pellie, Ferguson, Cvero, Wu, Whalley, Mair, Whiteley and Alex.
PY - 2023/9/28
Y1 - 2023/9/28
N2 - Background: Natural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications. Methods: We tested the F1 score, precision, and recall to compare NLP tools on a cohort from a study on delirium using images and radiology reports from NHS Fife and a population-based cohort (Generation Scotland) that spans multiple National Health Service health boards. We compared four off-the-shelf rule-based and neural NLP tools (namely, EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and reported on their performance for three cerebrovascular phenotypes, namely, ischaemic stroke, small vessel disease (SVD), and atrophy. Clinical experts from the EdIE-R team defined phenotypes using labelling techniques developed in the development of EdIE-R, in conjunction with an expert researcher who read underlying images. Results: EdIE-R obtained the highest F1 score in both cohorts for ischaemic stroke, ≥93%, followed by ALARM+, ≥87%. The F1 score of ESPRESSO was ≥74%, whilst that of Sem-EHR is ≥66%, although ESPRESSO had the highest precision in both cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R scored ≥98% and ALARM+ ≥90%. ESPRESSO scored lowest with ≥77% and Sem-EHR ≥81%. In NHS Fife, F1 scores for atrophy by EdIE-R and ALARM+ were 99%, dropping in Generation Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest for atrophy at 89% in NHS Fife and 73% in Generation Scotland. When comparing NLP tool output with brain image reads using F1 scores, ALARM+ scored 80%, outperforming EdIE-R at 66% in ischaemic stroke. For SVD, EdIE-R performed best, scoring 84%, with Sem-EHR 82%. For atrophy, EdIE-R and both ALARM+ versions were comparable at 80%. Conclusions: The four NLP tools show varying F1 (and precision/recall) scores across all three phenotypes, although more apparent for ischaemic stroke. If NLP tools are to be used in clinical settings, this cannot be performed “out of the box.” It is essential to understand the context of their development to assess whether they are suitable for the task at hand or whether further training, re-training, or modification is required to adapt tools to the target task.
AB - Background: Natural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications. Methods: We tested the F1 score, precision, and recall to compare NLP tools on a cohort from a study on delirium using images and radiology reports from NHS Fife and a population-based cohort (Generation Scotland) that spans multiple National Health Service health boards. We compared four off-the-shelf rule-based and neural NLP tools (namely, EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and reported on their performance for three cerebrovascular phenotypes, namely, ischaemic stroke, small vessel disease (SVD), and atrophy. Clinical experts from the EdIE-R team defined phenotypes using labelling techniques developed in the development of EdIE-R, in conjunction with an expert researcher who read underlying images. Results: EdIE-R obtained the highest F1 score in both cohorts for ischaemic stroke, ≥93%, followed by ALARM+, ≥87%. The F1 score of ESPRESSO was ≥74%, whilst that of Sem-EHR is ≥66%, although ESPRESSO had the highest precision in both cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R scored ≥98% and ALARM+ ≥90%. ESPRESSO scored lowest with ≥77% and Sem-EHR ≥81%. In NHS Fife, F1 scores for atrophy by EdIE-R and ALARM+ were 99%, dropping in Generation Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest for atrophy at 89% in NHS Fife and 73% in Generation Scotland. When comparing NLP tool output with brain image reads using F1 scores, ALARM+ scored 80%, outperforming EdIE-R at 66% in ischaemic stroke. For SVD, EdIE-R performed best, scoring 84%, with Sem-EHR 82%. For atrophy, EdIE-R and both ALARM+ versions were comparable at 80%. Conclusions: The four NLP tools show varying F1 (and precision/recall) scores across all three phenotypes, although more apparent for ischaemic stroke. If NLP tools are to be used in clinical settings, this cannot be performed “out of the box.” It is essential to understand the context of their development to assess whether they are suitable for the task at hand or whether further training, re-training, or modification is required to adapt tools to the target task.
KW - brain radiology
KW - electronic health records
KW - natural language processing
KW - stroke phenotype
UR - http://www.scopus.com/inward/record.url?scp=85174150809&partnerID=8YFLogxK
U2 - 10.3389/fdgth.2023.1184919
DO - 10.3389/fdgth.2023.1184919
M3 - Article
AN - SCOPUS:85174150809
SN - 2673-253X
VL - 5
SP - 1
EP - 14
JO - Frontiers in digital health
JF - Frontiers in digital health
M1 - 1184919
ER -