Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

Martin Koehler, Edward Abel, Alex Bogatu, Cristina Civili, Lacramioara Mazilu, Nikolaos Konstantinou, Alvaro A. A. Fernandes, John Keane, Leonid Libkin, Norman W. Paton

Research output: Contribution to journalArticlepeer-review

Abstract / Description of output

The process of preparing potentially large and complex data set for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process are carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. In typical big data applications, we need to ensure that all wrangling steps, including web extraction, selection, integration and cleaning, benefit from automation wherever possible. Towards this goal, in the paper we: i) introduce a notion of data context, which associates portions of a target schema with extensional data of types that are commonly available; ii) define a scalable methodology to boostrap an end-to-end data wrangling process based on data profiling; iii) describe how data context is used to inform automation in several steps within wrangling, specifically, matching, value format transformation, data repair, and mapping generation and selection to optimise the accuracy, consistency and relevant of the result; and iv) we evaluate the approach with real estate data and financial data, showing substantial improvements in the results of automated wrangling.
Original languageEnglish
Pages (from-to)169-186
Number of pages18
JournalIEEE Transactions on Big Data
Volume7
Issue number1
Early online date15 Apr 2019
DOIs
Publication statusPublished - 1 Mar 2021

Keywords / Materials (for Non-textual outputs)

  • data wrangling
  • data matching
  • mapping generation
  • data transformation
  • data cleaning
  • source selection

Fingerprint

Dive into the research topics of 'Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling'. Together they form a unique fingerprint.

Cite this