Abstract
Classifying scientific literature into an abstract set of topics requires leveraging various sources from the publication and external knowledge. In the BioCreative VII LitCovid track on COVID-19 literature multi-label topic annotation, we applied state-of-the-art deep learning based document classification models (BERT, variations of HAN, CNN, LSTM) and each with a different combination of metadata (title, abstract, keywords, and journal), knowledge sources, pre-trained embedding, and data augmentation techniques. Several ensemble techniques were then used to combine individual model outputs for synergized predictions. We showed that a class-specific average ensembling of the pre-trained and task-specific models achieved the best micro-F1 score in validation (90.31%) and testing (89.32%) sets in the experiments, beyond the medium (89.25%) and mean value (87.78%) of all 80 valid submissions. We summarize lessons learned from our work on this task
Original language | English |
---|---|
Title of host publication | KnowLab at BioCreative VII Track 5 LitCovid: Ensemble of deep learning models from diverse sources for COVID-19 literature classification |
Publisher | BioCreative |
Chapter | Track 5 LitCovid track Multi-label topic classification for COVID-19 literature annotation |
Pages | 310-313 |
Number of pages | 4 |
Volume | Proceedings of the BioCreative VII Challenge Evaluation Workshop |
ISBN (Electronic) | 978-0-578-32368-8 |
Publication status | Published - 8 Nov 2021 |
Keywords / Materials (for Non-textual outputs)
- deep learning
- ensemble learning
- multi-label classification
- document classification