Skip to main navigation Skip to search Skip to main content

Integrating vector databases across embedding models

Beining Yang, Yang Cao, Yang Ren

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Vector databases have been widely used to implement similarity search over unstructured objects, e.g., documents and images. Each vector database is produced by an embedding model that encodes the objects in a way such that more similar objects are embedded to closer vectors, allowing us to use top-k vector search as an implementation of top-k object similarity search. It is common practice that different vector databases use distinct embedding models and the same object may be encoded by different embedding vectors across databases. As a result, one cannot share and integrate vector databases to expand similarity search across datasets, a property we take for granted for relational databases. In this work, we attempt to break the barrier between different vector databases, by developing an approach to integrating vector databases generated by different embedding models, with neither any access to the encoded data objects nor knowledge of the embedding models. Our approach is rooted in the {\em local isometry hypothesis}, a finding made via extensive experiments on real-life embedding vectors, and is backed up by theoretical analysis that bounds the quality of integrated vector database. Experimental results show that we can integrate vector databases produced by various popular embedding models, e.g., NV-embed-V2, OpenAI Ada, GloVe, Mistral and FastText, while offering high recall of top-k similarity search over the integrated datasets.
Original languageEnglish
Title of host publicationProceedings of the 2026 ACM SIGMOD/PODS International Conference on Management of Data
Place of PublicationNew York, NY, United States
PublisherAssociation for Computing Machinery (ACM)
Pages1-28
Number of pages28
Publication statusAccepted/In press - 23 Aug 2025
EventThe 2026 ACM SIGMOD/PODS International Conference on Management of Data - Bengaluru, India
Duration: 31 May 20265 Jun 2026
Conference number: 52
https://2026.sigmod.org

Publication series

NameProceedings of the ACM on Management of Data
PublisherAssociation for Computing Machinery
Number6
Volume3
ISSN (Electronic)2836-6573

Conference

ConferenceThe 2026 ACM SIGMOD/PODS International Conference on Management of Data
Abbreviated titleSIGMOD 2026
Country/TerritoryIndia
CityBengaluru
Period31/05/265/06/26
Internet address

Keywords / Materials (for Non-textual outputs)

  • vector databases
  • data integration
  • cross-model vector database integration
  • embedding models
  • unstructured data

Fingerprint

Dive into the research topics of 'Integrating vector databases across embedding models'. Together they form a unique fingerprint.

Cite this