Edinburgh Research Explorer

Pattern Functional Dependencies for Data Cleaning

Research output: Contribution to journalArticle

  • Abdulhakim Qahtan
  • Nan Tang
  • Mourad Ouzzani
  • Yang Cao
  • Michael Stonebraker

Related Edinburgh Organisations

Open Access permissions

Open

Documents

  • Download as Adobe PDF

    Accepted author manuscript, 564 KB, PDF document

    Licence: Creative Commons: Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND)

  • Download as Adobe PDF

    Final published version, 612 KB, PDF document

    Licence: Creative Commons: Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND)

http://www.vldb.org/pvldb/vol13.html
Original languageEnglish
Pages (from-to)684-697
Number of pages14
JournalProceedings of the VLDB Endowment (PVLDB)
Volume13
Issue number5
DOIs
Publication statusPublished - 31 Jan 2020
Event46th International Conference on Very Large Data Bases - Tokyo, Japan
Duration: 31 Aug 20204 Sep 2020
Conference number: 46
https://vldb2020.org/

Abstract

Patterns (or regex-based expressions) are widely used to constrain the format of a domain (or a column), e.g., a Year column should contain only four digits, and thus a value like “1980-” might be a typo. Moreover, integrity constraints (ICs) defined over multiple columns, such as (conditional) functional dependencies and denial constraints, e.g., a ZIP code uniquely determines a city in the UK, have been widely used in data cleaning. However, a promising, but not yet explored, direction is to combine regex- and IC-based theories to capture data dependencies involving partial attribute values. For example, in an employee ID such as “F-9-107”, “F” is sufficient to determine the finance department.
Inspired by the above observation, we propose a novel class ofICs, called pattern functional dependencies (PFDs), to model fine-grained data dependencies gleaned from partial attribute values. These dependencies cannot be modeled using traditional ICs, such as (conditional) functional dependencies, which work on entire attribute values. We also present a set of axioms for the inference of PFDs, analogous to Armstrong’s axioms for FDs, and study the complexity of consistency and implication analysis of PFDs. Moreover, we devise an effective algorithm to automatically discover PFDs even in the presence of errors in the data. Our extensive experiments on 15 real-world datasets show that our approach can effectively discover valid and useful PFDs over dirty data, which can then be used to detect data errors that are hard to capture by other types ofICs

Event

46th International Conference on Very Large Data Bases

31/08/204/09/20

Tokyo, Japan

Event: Conference

ID: 128877362