Abstract
Background
Deriving structured data from unstructured clinical notes in electronic health records (EHRs) requires natural language processing and clinical expertise, which is often costly, and frequently a one-off investment. We implemented SemEHR, a semantic search system that reduces the expertise and effort required in this context. We aimed to use it to characterise and select patients for projects such as the UK Department of Health 100,000 Genome Project.
Methods
Built upon the off-the-shelf toolkits, Bio-YODIE and CogStack, SemEHR integrates heterogeneous EHR documents and identifies contextualised (negation, temporality, and experiencer) mentions of a wide range of biomedical concepts including SNOMED CT, ICD-10, LOINC, and Drug Ontology. Text mining and semantics techniques are incorporated to derive a longitudinal patient panorama, combining structured profiles and unstructured records, available through semantic search interfaces.
Findings
We deployed SemEHR in various UK hospital EHRs, including the South London and Maudsley NHS Foundation Trust, where 46 million concept mentions were identified from 18 million documents. In a liver disease study, SemEHR identified 94 of 100 hepatitis C positive manually annotated patients. In a HIV study, SemEHR identified 21 of 23 true positives in a 1000-patient cohort. At King's College Hospital, SemEHR is being used to recruit patients into the 100,000 Genomes Project, where ontological associations are integrated to match recruitment criteria and populate complex phenotype models. A preliminary evaluation suggests that the tool is able to validate previously submitted cases and is very fast in searching phenotypes.
Interpretation
Using SemEHR, a query such as “find patients with a family history of hepatitis C”, which previously might have required the user to have natural language processing expertise, becomes a simple search, for which SemEHR retrieves a relevant patient cohort, populates patient-level summaries, and provides a link to each mention in the original source. Results and feedback from the multiple studies have proven its efficiency: previously weeks or months of work can be done within minutes in some cases.
Deriving structured data from unstructured clinical notes in electronic health records (EHRs) requires natural language processing and clinical expertise, which is often costly, and frequently a one-off investment. We implemented SemEHR, a semantic search system that reduces the expertise and effort required in this context. We aimed to use it to characterise and select patients for projects such as the UK Department of Health 100,000 Genome Project.
Methods
Built upon the off-the-shelf toolkits, Bio-YODIE and CogStack, SemEHR integrates heterogeneous EHR documents and identifies contextualised (negation, temporality, and experiencer) mentions of a wide range of biomedical concepts including SNOMED CT, ICD-10, LOINC, and Drug Ontology. Text mining and semantics techniques are incorporated to derive a longitudinal patient panorama, combining structured profiles and unstructured records, available through semantic search interfaces.
Findings
We deployed SemEHR in various UK hospital EHRs, including the South London and Maudsley NHS Foundation Trust, where 46 million concept mentions were identified from 18 million documents. In a liver disease study, SemEHR identified 94 of 100 hepatitis C positive manually annotated patients. In a HIV study, SemEHR identified 21 of 23 true positives in a 1000-patient cohort. At King's College Hospital, SemEHR is being used to recruit patients into the 100,000 Genomes Project, where ontological associations are integrated to match recruitment criteria and populate complex phenotype models. A preliminary evaluation suggests that the tool is able to validate previously submitted cases and is very fast in searching phenotypes.
Interpretation
Using SemEHR, a query such as “find patients with a family history of hepatitis C”, which previously might have required the user to have natural language processing expertise, becomes a simple search, for which SemEHR retrieves a relevant patient cohort, populates patient-level summaries, and provides a link to each mention in the original source. Results and feedback from the multiple studies have proven its efficiency: previously weeks or months of work can be done within minutes in some cases.
Original language | English |
---|---|
Pages (from-to) | S97 |
Journal | The Lancet |
Volume | 390 |
DOIs | |
Publication status | Published - 1 Nov 2017 |