VChunkJoin: An Efficient Algorithm for Edit Similarity Joins

Wei Wang, Jianbin Qin, Chuan Xiao, Xuemin Lin, Heng Tao Shen

Research output: Contribution to journalArticlepeer-review

Abstract

Similarity joins play an important role in many application areas, such as data integration and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space.
Original languageEnglish
Pages (from-to)1916-1929
Number of pages14
JournalIEEE Transactions on Knowledge and Data Engineering
Volume25
Issue number8
Early online date17 Apr 2012
DOIs
Publication statusPublished - Aug 2013

Fingerprint Dive into the research topics of 'VChunkJoin: An Efficient Algorithm for Edit Similarity Joins'. Together they form a unique fingerprint.

Cite this