Generating Annotated Corpora for Reading Comprehension and Question Answering Evaluation

Tiphaine Dalmas, Jochen L. Leidner, Bonnie Webber, Claire Grover, Johan Bos

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Recently, reading comprehension tests for students and adult language learners have received increased attention within the NLP community as a means to develop and evaluate robust question answering (NLQA) methods. We present our ongoing work on automatically creating richly annotated corpus resources for NLQA and on comparing automatic methods for answering questions against this data set. Starting with the CBC4Kids corpus, we have added XML annotation layers for tokenization, lemmatization, stemming, semantic classes, POS tags and bestranking syntactic parses to support future experiments with semantic answer retrieval and inference. Using this resource, we have calculated a baseline for word-overlap based answer retrieval (Hirschman et al., 1999) on the CBC4Kids data and found the method performs slightly better than on the REMEDIA corpus. We hope that our richly annotated version of the CBC4Kids corpus will become a standard resource, especially as a controlled environment for evaluating inference-based techniques.
Original languageEnglish
Title of host publicationIn Proc. of EACL, Question Answering Workshop
Publication statusPublished - 2003


Dive into the research topics of 'Generating Annotated Corpora for Reading Comprehension and Question Answering Evaluation'. Together they form a unique fingerprint.

Cite this