Data Diff: Interpretable, Executable Summaries of Changes in Distributions for Data Wrangling

Charles Sutton, Timothy Hobson, James Geddes, Rich Caruana

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Many analyses in data science are not one-off projects, but are repeated over multiple data samples, such as once per month, once per quarter, and so on. For example, if a data scientist performs an analysis in 2017 that saves a significant amount of money, then she will likely to be asked to perform the same analysis on data from 2018. But more data analyses means more effort spent in data wrangling. We introduce the data diff problem, which attempts to turn this problem into an opportunity. Comparing the repeated data samples against each other, inconsistencies may be indicative of underlying issues in data quality. By analogy to text diff, the data diff problem is to find a “patch”, that is, transformation in a specified domain-specific language, that transforms the data samples so that they are identically distributed. We present a prototype tool for data diff that formalizes the problem as a bipartite matching problem, calibrating its parameters using a bootstrap procedure. The tool is evaluated quantitatively and through a case study on an open government data set.
Original languageEnglish
Title of host publicationProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
Place of PublicationLondon, United Kingdom
PublisherACM
Pages2279-2288
Number of pages10
ISBN (Print)978-1-4503-5552-0
DOIs
Publication statusPublished - 19 Jul 2018
EventKnowledge Discovery and Data Mining Conference 2018 - ExCel London, London, United Kingdom
Duration: 19 Aug 201823 Aug 2018
http://www.kdd.org/kdd2018/

Conference

ConferenceKnowledge Discovery and Data Mining Conference 2018
Abbreviated titleKDD 2018
CountryUnited Kingdom
CityLondon
Period19/08/1823/08/18
Internet address

Fingerprint

Dive into the research topics of 'Data Diff: Interpretable, Executable Summaries of Changes in Distributions for Data Wrangling'. Together they form a unique fingerprint.

Cite this