1 Million Captioned Dutch Newspaper Images

Desmond Elliott, Martijn Kleppe

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Images naturally appear alongside text in a wide variety of media, such as books, magazines, newspapers, and in online articles. This type of multi-modal data offers an interesting basis for vision and language research but most existing datasets use crowdsourced text, which removes the images from their original context. In this paper, we introduce the KBK-1M dataset of 1.6 million images in their original context, with co-occurring texts found in Dutch newspapers from 1922 - 1994. The images are digitally scanned photographs, cartoons, sketches, and weather forecasts; the text is generated from OCR scanned blocks. The dataset is suitable for experiments in automatic image captioning, image―article matching, object recognition, and data-to-text generation for weather forecasting. It can also be used by humanities scholars to analyse photographic style changes, the representation of people and societal issues, and new tools for exploring photograph reuse via image-similarity-based search.
Original languageEnglish
Title of host publicationProceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portoroz, Slovenia, May 23-28, 2016.
PublisherEuropean Language Resources Association (ELRA)
Number of pages5
ISBN (Print)978-2-9517408-9-1
Publication statusPublished - May 2016
Event10th edition of the Language Resources and Evaluation Conference - Portorož , Slovenia
Duration: 23 May 201628 May 2016


Conference10th edition of the Language Resources and Evaluation Conference
Abbreviated titleLREC 2016
Internet address


Dive into the research topics of '1 Million Captioned Dutch Newspaper Images'. Together they form a unique fingerprint.

Cite this