Robust cross-lingual genre classification through comparable corpora

Philipp Petrenz, Bonnie Webber

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Classification of texts by genre can benefit applications in Natural Language Processing and Information Retrieval. However, a mono-lingual approach requires large amounts of labeled texts in the target language. Work reported here shows that the benefits of genre classification can be extended to other languages through cross-lingual methods. Comparable corpora – here taken to be collections of texts from the same set of genres but written in different languages – are exploited to train classification models on multi-lingual text collections. The resulting genre classifiers are shown to be robust and high-performing when compared to mono-lingual training sets. The work also shows that comparable corpora can be used to identify features that are indicative of genre in various languages. These features can be considered stable genre predictors across a set of languages. Our experiments show that selecting stable features yields significant accuracy gains over the full feature set, and that a small amount of features can suffice to reliably distinguish between different genres.
Original languageEnglish
Title of host publicationThe 5th Workshop on Building and Using Comparable Corpora
Subtitle of host publicationSpecial Theme: “Language Resources for Machine Translation in Less-Resourced Languages and Domains”
Pages1-10
Number of pages9
Publication statusPublished - May 2012

Fingerprint

Dive into the research topics of 'Robust cross-lingual genre classification through comparable corpora'. Together they form a unique fingerprint.

Cite this