Unsupervised deduplication using cross-field dependencies

Rob Hall, Charles Sutton, Andrew McCallum

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent -- because venues tend to focus on a few research areas -- but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set of latent variables to control two disparate clustering models: a Dirichlet-multinomial model over titles, and a non-exchangeable string-edit model over venues. We show that modeling cross-field dependence yields a substantial improvement in performance -- a 58% reduction in error over a standard Dirichlet process mixture.
Original languageEnglish
Title of host publicationProceedings of the 14th ACM SIGKDD international conference on Knowledge Discovery and Data mining (KDD '08)
Place of PublicationNew York, NY, USA
PublisherACM
Pages310-317
Number of pages8
ISBN (Print)978-1-60558-193-4
DOIs
Publication statusPublished - 2008

Keywords

  • data mining
  • deduplication
  • dirichlet process mixture
  • information extraction

Cite this