From Segmentation to Analyses: A Probabilistic Model for Unsupervised Morphology Induction

Toms Bergmanis, Sharon Goldwater

Research output: Chapter in Book/Report/Conference proceedingConference contribution


A major motivation for unsupervised morphological analysis is to reduce the sparse data problem in under-resourced languages. Most previous work focuses on segmenting surface forms into their constituent morphs (e.g., taking: tak +ing), but surface form segmentation does not solve the sparse data problem as the analyses of take and taking are not connected to each other. We extend the MorphoChains system (Narasimhan et al., 2015) to provide morphological analyses that can abstract over spelling differences in functionally similar morphs. These analyses are not required to use all the orthographic material of a word (stopping: stop +ing), nor are they limited to only that material (acidified: acid +ify +ed). On average across six typologically varied languages our system has a similar or better F-score on EMMA (a measure of underlying morpheme accuracy) than three strong baselines; moreover, the total number of distinct morphemes identified by our system is on average 12.8% lower than for Morfessor (Virpioja et al., 2013), a stateof-the-art surface segmentation system.
Original languageEnglish
Title of host publicationProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers
PublisherAssociation for Computational Linguistics (ACL)
Number of pages10
ISBN (Print)978-1-945626-34-0
Publication statusE-pub ahead of print - 7 Apr 2017
Event15th EACL 2017 Software Demonstrations - Valencia, Spain
Duration: 3 Apr 20177 Apr 2017


Conference15th EACL 2017 Software Demonstrations
Abbreviated titleEACL 2017
Internet address


Dive into the research topics of 'From Segmentation to Analyses: A Probabilistic Model for Unsupervised Morphology Induction'. Together they form a unique fingerprint.

Cite this