Generalizing the Pigeonhole Principle for Similarity Query Processing in Hamming Space

Jianbin Qin, Chuan Xiao, Yaoshu Wang, Wei Wang, Xuemin Lin, Yoshiharu Ishikawa, Guoren Wang

Research output: Contribution to journalArticlepeer-review


A similarity search in Hamming space finds binary vectors whose Hamming distances are no more than a threshold from a query vector. It is a fundamental problem in many applications, such as image retrieval, near-duplicateWeb page detection, and scientific databases. State-of-the-art approaches to answering such queries are mainly based on the pigeonhole principle to generate a set of candidates and then verify them. We observe that the constraint by the pigeonhole principle is not always tight and may bring about unnecessary candidates. We also observe that the distribution in real data is often skew, but most existing solutions adopt a simple equi-width partitioning and allocate the same threshold to all the parts, hence failing to exploit the data skewness to optimize query processing. In this paper, we propose a new form of the pigeonhole principle which allows variable partitioning and threshold allocation. Based on the new principle, we develop a tight constraint of candidates and devise cost-aware methods for partitioning and threshold allocation to optimize query processing. Our evaluation on datasets with various data distributions shows the robustness of our solution and its superior query processing performance to the state-of-the-art methods.
Original languageEnglish
Pages (from-to)489 - 505
Number of pages16
JournalIEEE Transactions on Knowledge and Data Engineering
Issue number2
Early online date14 Feb 2019
Publication statusPublished - 1 Feb 2021


  • Hamming distance
  • Query processing
  • Search problems
  • Partitioning algorithms
  • Resource management
  • Indexes
  • similarity search
  • pigeonhole principle


Dive into the research topics of 'Generalizing the Pigeonhole Principle for Similarity Query Processing in Hamming Space'. Together they form a unique fingerprint.

Cite this