Broad coverage paragraph segmentation across languages and domains

Caroline Sporleder, Mirella Lapata

Research output: Contribution to journalArticlepeer-review


This article considers the problem of automatic paragraph segmentation. The task is relevant for speech-to-text applications whose output transcipts do not usually contain punctuation or paragraph indentation and are naturally difficult to read and process. Text-to-text generation applications (e.g., summarization) could also benefit from an automatic paragaraph segementation mechanism which indicates topic shifts and provides visual targets to the reader. We present a paragraph segmentation model which exploits a variety of knowledge sources (including textual cues, syntactic and discourse-related information) and evaluate its performance in different languages and domains. Our experiments demonstrate that the proposed approach significantly outperforms our baselines and in many cases comes to within a few percent of human performance. Finally, we integrate our method with a single document summarizer and show that it is useful for structuring the output of automatically generated text.
Original languageEnglish
Pages (from-to)1-35
Number of pages35
JournalACM Transactions on Speech and Language Processing
Issue number2
Publication statusPublished - 2006


Dive into the research topics of 'Broad coverage paragraph segmentation across languages and domains'. Together they form a unique fingerprint.

Cite this