Edinburgh Research Explorer

Cleaning Integrated Data: An Approach based on Conditional Constraints and Data Provenance

Project: Research

StatusFinished
Effective start/end date1/06/0730/11/10
Total award£588,563.00
Funding organisationEPSRC
Funder project referenceEP/E029213/1
Period1/06/0730/11/10

Key findings

Data quality has been perceived as the No. 1 problem for data management. Real life data is typically dirty, and it costs US enterprises alone $600 billion each year. There are five central aspects of data quality: data consistency, data accuracy, entity resolution, information completeness and data currency. The project has contributed to the study of each and every one of these aspects, from theory to practice. The key findings are as follows:
(1) Data consistency and accuracy: to identify and fix inconsistencies, conflicts and inaccuracies in a collection of data. We have developed the following:
(a) a conditional dependency theory to specify data quality rules that detect data inconsistencies and inaccuracies; and
(b) a complete package of practical techniques to improve data consistency and accuracy, from automatically discovering data quality rules and validating the rules discovered, to detecting errors in the data and repairing the data by fixing those errors, with performance guarantees on the quality of repairs.
(2) Entity resolution: to identify tuples from unreliable data sources that refer to the same real-world entity. We have proposed a semantic approach based on matching rules for entity resolution, including a dynamic constraint theory, an inference system and scalable matching techniques.
(3) Information incompleteness: to find attributes and tuples missing from data. To this end, we have developed a theory of relative information completeness. It improves the classical open world assumption and closed world assumption, and is more effective in specifying information completeness in real-life applications.
(4) Data currency: to identify the current values of real-world entities, and to answer queries using those current values. We have developed the first data currency model that does not depend on the availability of reliable timestamps, as well as practical techniques to deduce the true (consistent and current) values of entities.
The research outcome was disseminated through 27 publications in major international conferences and journals (SIGMOD, PODS, ICDE, TODS, VLDB J, TKDE), two US patents, and several invited talks at conferences and seminars at various institutes in the US, Europe and Asia.