Comparing High Dimensional Word Embeddings Trained on Medical Text to Bag-of-Words For Predicting Medical Codes

Vithya Yogarajan, Henry Gouk, Tony Smith, Michael Mayo, Bernhard Pfahringer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Word embeddings are a useful tool for extracting knowledge from the free-form text contained in electronic health records, but it has become commonplace to train such word embeddings on data that do not accurately reflect how language is used in a healthcare context. We use prediction of medical codes as an example application to compare the accuracy of word embeddings trained on health corpora to those trained on more general collections of text. It is shown that both an increase in embedding dimensionality and an increase in the volume of health-related training data improves prediction accuracy. We also present a comparison to the traditional bag-of-words feature representation, demonstrating that in many cases, this conceptually simple method for representing text results in superior accuracy to that of word embeddings.
Original languageEnglish
Title of host publicationIntelligent Information and Database Systems
Subtitle of host publicationACIIDS 2020
EditorsNgoc Thanh Nguyen, Kietikul Jearanaitanakij, Ali Selamat, Bogdan Trawiński
PublisherSpringer, Cham
Pages97-108
Number of pages12
ISBN (Electronic)978-3-030-41964-6
ISBN (Print)978-3-030-41963-9
DOIs
Publication statusPublished - 4 Mar 2020
Event12th Asian Conference on Intelligent Information and Database Systems - Phuket, Thailand
Duration: 23 Mar 202026 Mar 2020
Conference number: 12
https://aciids.pwr.edu.pl/2020/

Publication series

Name Lecture Notes in Computer Science
Volume12033
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference12th Asian Conference on Intelligent Information and Database Systems
Abbreviated titleACIIDS 2020
Country/TerritoryThailand
CityPhuket
Period23/03/2026/03/20
Internet address

Keywords

  • Word embeddings
  • Binary classification
  • Machine learning for health

Fingerprint

Dive into the research topics of 'Comparing High Dimensional Word Embeddings Trained on Medical Text to Bag-of-Words For Predicting Medical Codes'. Together they form a unique fingerprint.

Cite this