GPH: Similarity Search in Hamming Space

Jianbin Qin, Yaoshu Wang, Chuan Xiao, Wei Wang, Xuemin Lin, Yoshiharu Ishikawa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A similarity search in Hamming space finds binary vectors whose Hamming distances are no more than a threshold from a query vector. It is a fundamental problem in many applications, including image retrieval, near-duplicate Web page detection, and machine learning. State-of-the-art approaches to answering such queries are mainly based on the pigeonhole principle to generate a set of candidates and then verify them.
We observe that the constraint based on the pigeonhole principle is not always tight and hence may bring about unnecessary candidates. We also observe that the distribution in real data is often skew, but most existing solutions adopt a simple equi-width partitioning and allocate the same threshold to all the partitions, and hence fail to exploit the data skewness to optimize the query processing. In this paper, we propose a new form of the pigeonhole principle which allows variable partition size and threshold. Based on the new principle, we first develop a tight constraint of candidates, and then devise cost-aware methods for dimension partitioning and threshold allocation to optimize query processing. Our evaluation on datasets with various data distributions shows the robustness of our solution and its superior query processing performance to the state-of-the-art methods.
Original languageEnglish
Title of host publication34th IEEE International Conference on Data Engineering
Place of PublicationParis, France
PublisherInstitute of Electrical and Electronics Engineers
Pages1-14
Number of pages14
ISBN (Electronic)978-1-5386-5520-7
ISBN (Print)978-1-5386-5521-4
DOIs
Publication statusPublished - 25 Oct 2018
Event34th IEEE International Conference on Data Engineering - Paris, France
Duration: 16 Apr 201819 Apr 2018
https://icde2018.org/

Publication series

Name
PublisherIEEE
ISSN (Print)1063-6382
ISSN (Electronic)2375-026X

Conference

Conference34th IEEE International Conference on Data Engineering
Abbreviated titleICDE 2018
Country/TerritoryFrance
CityParis
Period16/04/1819/04/18
Internet address

Fingerprint

Dive into the research topics of 'GPH: Similarity Search in Hamming Space'. Together they form a unique fingerprint.

Cite this