Abstract / Description of output
This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirements.
Original language | English |
---|---|
Title of host publication | 2019 IEEE 15th International Conference on e-Science (e-Science) |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 235-242 |
Number of pages | 8 |
ISBN (Electronic) | 978-1-7281-2451-3 |
ISBN (Print) | 978-1-7281-2452-0 |
DOIs | |
Publication status | Published - 19 Mar 2020 |
Event | 2019 IEEE 15th International Conference on e-Science (e-Science) - San Diego, United States Duration: 24 Sept 2019 → 27 Sept 2019 https://escience2019.sdsc.edu/ |
Conference
Conference | 2019 IEEE 15th International Conference on e-Science (e-Science) |
---|---|
Abbreviated title | e-Science 2019 |
Country/Territory | United States |
City | San Diego |
Period | 24/09/19 → 27/09/19 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- text mining
- distributed queries
- Apache Spark
- High-Performance Computing
- XML schemas
- digital tools
- digitised primary historical sources
- humanities research