Data Cleaning using Probabilistic Models of Integrity Constraints

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In data cleaning, data quality rules provide a valuable tool for enforcing the correct application of semantics on a dataset. Traditional rule discovery techniques assume a reasonably clean dataset, and fail when faced with a dirty one. Enforcement of these rules for error detection is much less effective when mined on dirty data.
In the databases literature, a popular and expressive type of logic-based data quality rule (or Integrity Constraint) is the constant Conditional Functional Dependency (cCFD) [Fan et al., 2011], which can be easily understood by a data analyst.
We introduce a probabilistic model that combines error detection and rule induction (cCFDs), we show that this methodology performs better than just traditional logic-based error detection. Moreover, after inference is performed, we provide a set of rules which is statistically sound and with low redundancy. To the best of our knowledge this is the first work to combine statistical anomaly detection with logic-based approaches to data cleaning.
Original languageEnglish
Title of host publicationNIPS 2016 Workshop on Artificial Intelligence for Data Science (AI4DataSci 2016)
Number of pages3
Publication statusPublished - 10 Dec 2016
EventNIPS 2016 Workshop on Artificial Intelligence for Data Science - Barcelona, Spain
Duration: 10 Dec 201610 Dec 2016
https://nips.cc/Conferences/2016/Schedule?type=Workshop&day=5

Conference

ConferenceNIPS 2016 Workshop on Artificial Intelligence for Data Science
Abbreviated titleNIPS 2016
CountrySpain
CityBarcelona
Period10/12/1610/12/16
Internet address

Fingerprint Dive into the research topics of 'Data Cleaning using Probabilistic Models of Integrity Constraints'. Together they form a unique fingerprint.

Cite this