前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >关于向量数据库,你需要知道的一切

关于向量数据库,你需要知道的一切

作者头像
Zilliz RDS
发布2022-02-21 15:35:12
1.7K0
发布2022-02-21 15:35:12
举报

✏️ 编者按:

在数据量指数型增长、数据类型日益丰富的今天,标量数据存储已不能满足日新月异的数据场景。我们要如何存储和管理图像、视频、文本等非结构化数据?推荐系统、语义理解、新药发现、股票市场分析……向量数据库又是如何应对这些复杂场景的?

🤵‍♂️ 作者简介:

Frank Liu,Zilliz AI 技术运营与产品专家,拥有斯坦福大学电子电气工程硕士学位,曾就职于美国雅虎公司,后回国创业,在人工智能算法研究和应用方面拥有多年经验。Frank 目前在 Zilliz 负责 Towhee 项目的技术与产品运营,围绕社区构建开源技术生态体系,全面推进用户侧的转化落地与场景实现。

Relational is not Enough

Data is everywhere. In the early days of the internet, data was mostly structured, and could easily be stored and managed in relational databases. Take, for example, a book database:

Storing and searching across table-based data such as the one shown above is exactly what relational databases were designed to do. In the example above, each row within the database respresents a particular book, while the columns correspond to a particular category of information. When a user looks up book(s) through an online service, they can do so through any of the column names present within the database. For example, querying over all results where the author name is Bill Bryson returns all of Bryson's books.

As the internet grew and evolved, unstructured data (magazine articles, shared photos, short videos, etc.) became increasingly common. Unlike structured data, there is no easy way to store the contents of unstructured data within a relational database. Imagine, for example, trying to search for similar shoes given a collection of shoe pictures from various angles; this would be impossible in a relational database since understanding shoe style, size, color, etc... purely from the image's raw pixel values is impossible.

X2vec: A New Way to Understand Data

This brings us to vector databases. The increasing ubiquity of unstructured data has led to a steady rise in the use of machine learning models trained to understand such data. Word2vec, a natural language processing (NLP) algorithm which uses a neural network to learn word associations, is a well-known early example of this. The word2vec model is capable of turning single words (in a variety of languages, not just English) into a list of floating point values, or vectors. Due to the way the machine learning model is trained, vectors which are close to each other represent words which are similar to each other, hence the term embedding vectors. We'll get into a bit more detail (with code!) in the next section.

The idea of turning a piece of unstructured data into a list of numerical values is nothing new*. As deep learning gained steam in both academic and industry circles, new ways to represent text, audio, and images came to be. A common component of all these representations is their use of embedding vectors generated by trained deep neural networks. Going back to the example of word2vec, we can see that the generated embeddings contain significant semantic information.

* Early computer vision and image processing relied on local feature descriptors to turn an image into a “bag” of embedding vectors – one vector for each detected keypoint. SIFT, SURF, and ORB are three well-known feature descriptors you may have heard of. These feature descriptors, while useful for matching images with one another, proved to be a fairly poor way to represent audio (via spectrograms) and images.
Example: Apple, the company, the fruit, ... or both?

The word "apple" can refer to both the company as well as the delicious red fruit. In this example, we can see that Word2Vec retains both meanings.

[('droid_x', 0.6324754953384399)]
[('apple', 0.6410146951675415)]

"Droid" refers to Samsung's first 4G LTE smartphone ("Samsung" + "iPhone" - "Apple" = "Droid"), while "apple" is the 10th closest word to "fruit".

While there are newer and better deep learning algorithms/models for generating word embeddings (ELMo, GPT-2, and BERT, to name a few), the concept remains the same.

Vectors generated from multilayer neural networks have enough high-level information to be applicable for a variety of tasks.

Vector embeddings are not just limited to natural language. In the example below, we use the towhee library (https://github.com/towhee-io/towhee) to generate embedding vectors for three different images, two of which have similar content:

Generating embeddings

Now let's use towhee to generate embeddings for our images.

Now let's compute distances

dog0 to dog1 distance: 0.59794164
dog0 to car distance: 1.1380062

Searching Across Vectors

Now that we’ve seen the representational power of vector embeddings, let’s take a bit of time to briefly discuss indexing the vectors. Like relational databases, vector databases need to be searchable in order to be truly useful – just storing the vector and its associated metadata is not enough. This is called nearest neighbor search, or NN search for short, and alone can be considered a subfield of machine learning and pattern recognition due to the sheer number of solutions proposed.

Vector search is generally split into two components - the similarity metric and the index. The similarity metric defines how the distance between two vectors is evaluated, while the index is a data structure that facilitates the search process. Similarity metrics are fairly straightforward - the most common similarity metric is the inverse of the L2 norm (also known as Euclidean distance). On the other hand, a diverse set of indices exist, each of which has its own set of advantages and disadvantages. We won't go into the details of vector indices here (that's a topic for another article), just know that, without them, a single query vector would need to be compared with all other vectors in the database, making the query process excruciatingly long.

Putting It All Together

Now that we understand the representational power of embedding vectors and have a good general overview of how vector search works, it’s now time to put the two concepts together – welcome to the world of vector databases. Vector databases are purpose-built to store, index, and query across embedding vectors generated by passing unstructured data through machine learning models.

When scaling to huge numbers of vector embeddings, searching across embedding vectors (even with indices) can be prohibitively expensive. Despite this, the best and most advanced vector databases will allow you to insert and search across millions or even billions of target vectors, in addition to specifying an indexing algorithm and similarity metric of your choosing.

Like the production-ready relational databases, vector databases should meet a few key performance targets before they can be deployed in actual production environments:

  1. Scalable: Embedding vectors are fairly small in terms of absolute memory, but to facilitate read and write speeds, they are usually stored in-memory (disk-based NN/ANN search is a topic for another blog post). When scaling to billions of embedding vectors and beyond, storage and compute quickly become unmanageable for a single machine. Sharding can solve this problem, but this requires splitting the indexes across multiple machines as well.
  2. Reliable: Modern relational databases are fault-tolerant. Replication allows cloud native enterprise databases to avoid having single points of failure, enabling graceful startup and shutdown. Vector databases are no different, and should be able to handle internal faults without data loss and with minimal operational impact.
  3. Fast: Yes, query and write speeds are important, even for vector databases. An increasingly common use case is processing and indexing database inputs in realtime. For platforms such as Snapchat and Instagram, which can have hundreds or thousands of new photos (a type of unstructured data) uploaded per second, speed becomes an incredibly important factor.

With data being generated at unprecedented rates, making sense of all the data through vector databases will become increasingly important.

The World’s Most Advanced Vector Database

Milvus, an open source vector database, is a leader in this space. Milvus provides a number of demos that you can use to evaluate the capabilities and uses cases of vector databases. With the release of Milvus 2.0 GA, Milvus is now a cloud-native, fault tolerant system capable of scaling to billions of vectors on beyond. Setup is done via a simple docker command, while inserts and queries across Milvus can be done via our Python, Go, Node.js, or Java bindings. For more information, we welcome you to visit us at milvus.io.

We hope the information in this post is useful for you. We’ll keep the posts coming on a regular basis, so feel free to come back for more material on vector databases, unstructured data, or AI/ML in general.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2022-01-13,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 ZILLIZ 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Relational is not Enough
  • X2vec: A New Way to Understand Data
    • The word "apple" can refer to both the company as well as the delicious red fruit. In this example, we can see that Word2Vec retains both meanings.
    • Generating embeddings
      • Now let's use towhee to generate embeddings for our images.
      • Now let's compute distances
      • Searching Across Vectors
      • Putting It All Together
      • The World’s Most Advanced Vector Database
      相关产品与服务
      数据库
      云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建,轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量,更专注于业务发展,让企业一站式享受数据上云及分布式架构的技术红利!
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档