Abstract
A similarity search in Hamming space finds binary vectors whose Hamming distances are no more than a threshold from a query vector. It is a fundamental problem in many applications, including image retrieval, near-duplicate Web page detection, and machine learning. State-of-the-art approaches to answering such queries are mainly based on the pigeonhole principle to generate a set of candidates and then verify them.
We observe that the constraint based on the pigeonhole principle is not always tight and hence may bring about unnecessary candidates. We also observe that the distribution in real data is often skew, but most existing solutions adopt a simple equi-width partitioning and allocate the same threshold to all the partitions, and hence fail to exploit the data skewness to optimize the query processing. In this paper, we propose a new form of the pigeonhole principle which allows variable partition size and threshold. Based on the new principle, we first develop a tight constraint of candidates, and then devise cost-aware methods for dimension partitioning and threshold allocation to optimize query processing. Our evaluation on datasets with various data distributions shows the robustness of our solution and its superior query processing performance to the state-of-the-art methods.
We observe that the constraint based on the pigeonhole principle is not always tight and hence may bring about unnecessary candidates. We also observe that the distribution in real data is often skew, but most existing solutions adopt a simple equi-width partitioning and allocate the same threshold to all the partitions, and hence fail to exploit the data skewness to optimize the query processing. In this paper, we propose a new form of the pigeonhole principle which allows variable partition size and threshold. Based on the new principle, we first develop a tight constraint of candidates, and then devise cost-aware methods for dimension partitioning and threshold allocation to optimize query processing. Our evaluation on datasets with various data distributions shows the robustness of our solution and its superior query processing performance to the state-of-the-art methods.
| Original language | English |
|---|---|
| Title of host publication | 34th IEEE International Conference on Data Engineering |
| Place of Publication | Paris, France |
| Publisher | Institute of Electrical and Electronics Engineers |
| Pages | 1-14 |
| Number of pages | 14 |
| ISBN (Electronic) | 978-1-5386-5520-7 |
| ISBN (Print) | 978-1-5386-5521-4 |
| DOIs | |
| Publication status | Published - 25 Oct 2018 |
| Event | 34th IEEE International Conference on Data Engineering - Paris, France Duration: 16 Apr 2018 → 19 Apr 2018 https://icde2018.org/ |
Publication series
| Name | |
|---|---|
| Publisher | IEEE |
| ISSN (Print) | 1063-6382 |
| ISSN (Electronic) | 2375-026X |
Conference
| Conference | 34th IEEE International Conference on Data Engineering |
|---|---|
| Abbreviated title | ICDE 2018 |
| Country/Territory | France |
| City | Paris |
| Period | 16/04/18 → 19/04/18 |
| Internet address |