Seeded hierarchical clustering for expert-crafted taxonomies

Anish Saha, Amith Ananthram, Emily Allaway, Heng Ji, Kathleen McKeown

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Practitioners from many disciplines (e.g., political science) use expert-crafted taxonomies to make sense of large, unlabeled corpora. In this work, we study Seeded Hierarchical Clustering (SHC): the task of automatically fitting unlabeled data to such taxonomies using a small set of labeled examples. We propose HierSeed, a novel weakly supervised algorithm for this task that uses only a small set of labeled seed examples in a computation and data efficient manner. HierSeed assigns documents to topics by weighing document density against topic hierarchical structure. It outperforms unsupervised and supervised baselines for the SHC task on three real-world datasets.
Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics: EMNLP 2022
EditorsYoav Goldberg, Zornitsa Kozareva, Yue Zhang
Place of PublicationAbu Dhabi, United Arab Emirates
PublisherAssociation for Computational Linguistics
Pages1595-1609
Number of pages15
Edition3
ISBN (Electronic)9781959429432
DOIs
Publication statusPublished - 11 Dec 2022
EventThe 2022 Conference on Empirical Methods in Natural Language Processing - Abu Dhabi National Exhibition Centre, Abu Dhabi, United Arab Emirates
Duration: 7 Dec 202211 Dec 2022
Conference number: 27
https://2022.emnlp.org/

Publication series

NameFindings of the Association for Computational Linguistics
PublisherACL
ISSN (Print)0891-2017
ISSN (Electronic)1530-9312

Conference

ConferenceThe 2022 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2022
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period7/12/2211/12/22
Internet address

Fingerprint

Dive into the research topics of 'Seeded hierarchical clustering for expert-crafted taxonomies'. Together they form a unique fingerprint.

Cite this