Edinburgh Research Explorer

defoe: A Spark-based Toolbox for Analysing Digital Historical Textual Data

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Related Edinburgh Organisations

Open Access permissions

Open

Documents

Original languageEnglish
Title of host publication2019 IEEE 15th International Conference on e-Science (e-Science)
Number of pages8
Publication statusAccepted/In press - 10 Jul 2019
Event2019 IEEE 15th International Conference on e-Science (e-Science) - San Diego, United States
Duration: 24 Sep 201927 Sep 2019
https://escience2019.sdsc.edu/

Conference

Conference2019 IEEE 15th International Conference on e-Science (e-Science)
Abbreviated titlee-Science 2019
CountryUnited States
CitySan Diego
Period24/09/1927/09/19
Internet address

Abstract

This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirements.

    Research areas

  • text mining, distributed queries, Apache Spark, High-Performance Computing, XML schemas, digital tools, digitised primary historical sources, humanities research

Event

2019 IEEE 15th International Conference on e-Science (e-Science)

24/09/1927/09/19

San Diego, United States

Event: Conference

Download statistics

No data available

ID: 102404643