OKHK 👀

11:45 · 2024年11月3日 · 周日

https://www.timescale.com/blog/vector-databases-are-the-wrong-abstraction/

一般来说，向量数据库的定位都是一个辅助组件，存储核心数据所生成的 text embedding。但当核心数据发生变化，向量存储（包括 metadata）也必须更新，这意味着一致性维护的负担，系统会越来越复杂和容易出错，我在开发 RAG 应用的过程中对此有着深刻的体会。

"You're building a RAG system, and your team uses Pinecone as a vector database to store and search embeddings. But you can't just use Pinecone—your text data doesn't fit well into Pinecone's metadata, so you're also using DynamoDB to handle those blobs and application data. And for lexical search, you needed OpenSearch. Now you're juggling three systems, and syncing them is a nightmare."

Vector databases treat embeddings as independent data, divorced from the source data from which embeddings are created, rather than what they truly are: derived data. By treating embeddings as independent data, we’ve created unnecessary complexity for ourselves.

这篇文章批判向量数据库的原罪，把原因讲得很透彻——向量数据库把向量当作独立数据存储，但他们其实是派生数据，应该与核心数据相邻存放，由数据库来维持更新和一致性。

In this post, we'll propose a better way: treating embeddings more like database indexes through what we call the **"vectorizer"** abstraction. This approach automatically keeps embeddings in sync with their source data, eliminating the maintenance costs that plague current implementations.

文章提出了 vectorizers 的概念，类似 index，vectorizer 在创建之后会自动维护表中字段的向量化数据，无须手动进行 C/U/D。他们开发了一个叫 pgai 的工具，为 PostgreSQL 提供了 vectorizer 功能。我认为这个设计理念是向量存储的未来，希望 pgai 能尽早稳定并推广，启发更多数据库做出类似实现。

Tiger Data Blog

Vector Databases Are the Wrong Abstraction

Today’s vector databases disconnect embeddings from their source data. We should treat embeddings more like database indexes—here’s how.