Abstract
Vector databases have been widely used to implement similarity search over unstructured objects, e.g., documents and images. Each vector database is produced by an embedding model that encodes the objects in a way such that more similar objects are embedded to closer vectors, allowing us to use top-k vector search as an implementation of top-k object similarity search. It is common practice that different vector databases use distinct embedding models and the same object may be encoded by different embedding vectors across databases. As a result, one cannot share and integrate vector databases to expand similarity search across datasets, a property we take for granted for relational databases. In this work, we attempt to break the barrier between different vector databases, by developing an approach to integrating vector databases generated by different embedding models, with neither any access to the encoded data objects nor knowledge of the embedding models. Our approach is rooted in the {\em local isometry hypothesis}, a finding made via extensive experiments on real-life embedding vectors, and is backed up by theoretical analysis that bounds the quality of integrated vector database. Experimental results show that we can integrate vector databases produced by various popular embedding models, e.g., NV-embed-V2, OpenAI Ada, GloVe, Mistral and FastText, while offering high recall of top-k similarity search over the integrated datasets.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 2026 ACM SIGMOD/PODS International Conference on Management of Data |
| Place of Publication | New York, NY, United States |
| Publisher | Association for Computing Machinery (ACM) |
| Pages | 1-28 |
| Number of pages | 28 |
| Publication status | Accepted/In press - 23 Aug 2025 |
| Event | The 2026 ACM SIGMOD/PODS International Conference on Management of Data - Bengaluru, India Duration: 31 May 2026 → 5 Jun 2026 Conference number: 52 https://2026.sigmod.org |
Publication series
| Name | Proceedings of the ACM on Management of Data |
|---|---|
| Publisher | Association for Computing Machinery |
| Number | 6 |
| Volume | 3 |
| ISSN (Electronic) | 2836-6573 |
Conference
| Conference | The 2026 ACM SIGMOD/PODS International Conference on Management of Data |
|---|---|
| Abbreviated title | SIGMOD 2026 |
| Country/Territory | India |
| City | Bengaluru |
| Period | 31/05/26 → 5/06/26 |
| Internet address |
Keywords / Materials (for Non-textual outputs)
- vector databases
- data integration
- cross-model vector database integration
- embedding models
- unstructured data
Fingerprint
Dive into the research topics of 'Integrating vector databases across embedding models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver