Horses to Zebras: Ontology-Guided Data Augmentation and Synthesis for ICD-9 Coding

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Medical document coding is the process of assigning labels from a structured label space (ontology -- e.g., ICD-9) to medical documents. This process is laborious, costly, and error-prone. In recent years, efforts have been made to automate this process with neural models. The label spaces are large (in the order of thousands of labels) and follow a big-head long-tail label distribution, giving rise to few-shot and zero-shot scenarios. Previous efforts tried to address these scenarios within the model, leading to improvements on rare labels, but worse results on frequent ones. We propose data augmentation and synthesis techniques in order to address these scenarios. We further introduce an analysis technique for this setting inspired by confusion matrices. This analysis technique points to the positive impact of data augmentation and synthesis, but also highlights more general issues of confusion within families of codes, and underprediction.
Original languageEnglish
Title of host publicationProceedings of the 21st Workshop on Biomedical Language Processing
EditorsDina Demner-Fushman, Kevin Bretonnel Cohen, Sophia Ananiadou, Junichi Tsujii
Place of PublicationDublin, Ireland
PublisherAssociation for Computational Linguistics
Pages389-401
Number of pages13
ISBN (Electronic)978-1-955917-27-8
DOIs
Publication statusPublished - 3 Jun 2022
EventThe 21st Workshop on Biomedical Language Processing - Dublin, Ireland
Duration: 26 May 202226 May 2022
Conference number: 21

Workshop

WorkshopThe 21st Workshop on Biomedical Language Processing
Abbreviated titleBIONLP 2022
Country/TerritoryIreland
CityDublin
Period26/05/2226/05/22

Fingerprint

Dive into the research topics of 'Horses to Zebras: Ontology-Guided Data Augmentation and Synthesis for ICD-9 Coding'. Together they form a unique fingerprint.

Cite this