Local Similarity Search for Unstructured Text

Pei Wang, Chuan Xiao, Jianbin Qin, Wei Wang, Xiaoyang Zhang, Yoshiharu Ishikawa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With the growing popularity of electronic documents, replication can occur for many reasons. People may copy text segments from various sources and make modifications. In this paper, we study the problem of local similarity search to find partially replicated text. Unlike existing studies on similarity search which find entirely duplicated documents, our target is to identify documents that approximately share a pair of sliding windows which differ by no more than τ tokens. Our problem is technically challenging because for sliding windows the tokens to be indexed are less selective than entire documents, rendering set similarity join-based algorithms less efficient. Our proposed method is based on enumerating token combinations to obtain signatures with high selectivity. In order to strike a balance between signature and candidate generation, we partition the token universe and for different partitions we generate combinations composed of different numbers of tokens. A cost-aware algorithm is devised to find a good partitioning of the token universe. We also propose to leverage the overlap between adjacent windows to share computation and thus speed up query processing. In addition, we develop the techniques to support the large thresholds. Experiments on real datasets demonstrate the efficiency of our method against alternative solutions.
Original languageEnglish
Title of host publicationProceedings of the 2016 International Conference on Management of Data
Place of PublicationSan Francisco, California, USA
PublisherACM
Pages1991-2005
Number of pages15
ISBN (Electronic)978-1-4503-3531-7
DOIs
Publication statusPublished - 26 Jun 2016
Event2016 International Conference on Management of Data - San Francisco, United States
Duration: 26 Jun 20161 Jul 2016
http://sigmod2016.org/

Conference

Conference2016 International Conference on Management of Data
Abbreviated titleSIGMOD/PODS'16
Country/TerritoryUnited States
CitySan Francisco
Period26/06/161/07/16
Internet address

Fingerprint

Dive into the research topics of 'Local Similarity Search for Unstructured Text'. Together they form a unique fingerprint.

Cite this