Syntactic Chunking Across Different Corpora

Weiqun Xu, Jean Carletta, Johanna Moore

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Syntactic chunking has been a well-defined and well-studied task since its introduction in 2000 as the CONLL shared task. Though some efforts have been further spent on chunking performance improvement, the experimental data has been restricted, with few exceptions, to (part of) the Wall Street Journal data, as adopted in the shared task. It remains open how those successful chunking technologies could be extended to other data, which may differ in genre/domain and/or amount of annotation. In this paper we first train chunkers with three classifiers on three different data sets and test on four data sets. We also vary the size of training data systematically to show data requirements for chunkers. It turns out that there is no significant difference between those state-of-the-art classifiers; training on plentiful data from the same corpus (switchboard) yields comparable results to Wall Street Journal chunkers even when the underlying material is spoken; the results from a large amount of unmatched training data can be obtained by using a very modest amount of matched training data.
Original languageEnglish
Title of host publicationMachine Learning for Multimodal Interaction
Subtitle of host publicationThird International Workshop, MLMI 2006, Bethesda, MD, USA, May 1-4, 2006, Revised Selected Papers
EditorsSteve Renals, Samy Bengio, Jonathan G. Fiscus
PublisherSpringer Berlin Heidelberg
Number of pages12
ISBN (Electronic)978-3-540-69268-3
ISBN (Print)978-3-540-69267-6
Publication statusPublished - 2006

Publication series

NameLecture Notes in Computer Science
PublisherSpringer Berlin Heidelberg
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Dive into the research topics of 'Syntactic Chunking Across Different Corpora'. Together they form a unique fingerprint.

Cite this