应用 Embedding 功能-向量数据库-文档中心-腾讯云

腾讯云向量数据库（Tencent Cloud VectorDB）默认开通 Embedding 功能。您需要在创建 Collection 时，指定 Embedding 模型，并配置相关参数，才能在写入数据、更新数据、检索数据时应用 Embedding 能力。
创建 Collection 指定 Embedding 模型
如下请求示例，应用 /collection/create 创建数据库表 book-emb。其中，embedding 参数中 field 指定了文本信息的字段为 text，vectorField 指定了文本信息转换为向量之后存储的字段，而 model 则指定了 Embedding 模型。更多信息，请参见/collection/create。
注意：
如下示例可直接复制，在 CVM 运行之前，您需在文本编辑器将 api_key=A5VOgsMpGWJhUI0WmUbY******************** 与 10.0.X.X 依据实际情况进行替换。 
配置 Embedding 参数，若不配置 indexs 中的 dimension 参数，则 dimension 将自动配置为 Embedding 模型对应的向量维度。如果配置的 dimension 与 Embedding 模型对应的向量维度不一致，会提示错误信息。 Embedding 模型对应的向量维度，请参见模型信息。
curl -i -X POST \\
  -H 'Content-Type: application/json' \\
  -H 'Authorization: Bearer account=root&api_key=A5VOgsMpGWJhUI0WmUbY********************' \\
  http://10.0.X.X:80/collection/create \\
  -d '{
    "database": "db-test",
    "collection": "book-emb",
    "replicaNum": 2,
    "shardNum": 1,
    "description": "this is the collection description",
    "embedding": {
        "field": "text",
        "vectorField": "vector",
        "model": "bge-base-zh"
    },
    "indexes": [
        {
            "fieldName": "id",
            "fieldType": "string",
            "indexType": "primaryKey"
        },
        {
            "fieldName": "vector",
            "fieldType": "vector",
            "indexType": "HNSW",
            "metricType": "COSINE",
            "params": {
                "M": 16,
                "efConstruction": 200
            }
        },
        {
            "fieldName": "bookName",
            "fieldType": "string",
            "indexType": "filter"
        },
        {
            "fieldName": "author",
            "fieldType": "string",
            "indexType": "filter"
        }
    ]
}'
写入文本数据
使用 /document/upsert 给数据库为 db-test，Collection 为 book-emb 批量插入数据。如下示例中，通过字段 text 传入原始文本数据。text 则为创建 Collection 时 Emdedding 参数 field 对应指定的文本字段名（示例中定义为 text）。
说明：
若不确定该 Collection 是否配置 Embedding 模型，写入数据之前，可通过 /collection/describe 查看 Emdedding 参数 status 是否为 enabled。
curl -i -X POST \\
  -H 'Content-Type: application/json' \\
  -H 'Authorization: Bearer account=root&api_key=A5VOgsMpGWJhUI0WmUbY********************' \\
  http://10.0.X.X:80/document/upsert \\
  -d '{
    "database": "db-test",
    "collection": "book-emb",
    "buildIndex": true,
    "documents": [
        {
            "id": "0001",
            "text": "话说天下大势，分久必合，合久必分。",
            "author": "罗贯中",
            "bookName": "三国演义",
            "page": 21
        },
        {
            "id": "0002",
            "text": "混沌未分天地乱，茫茫渺渺无人间。",
            "author": "吴承恩",
            "bookName": "西游记",
            "page ": 22
        },
        {
            "id": "0003",
            "text": "甄士隐梦幻识通灵，贾雨村风尘怀闺秀。",
            "author": "曹雪芹",
            "bookName": "红楼梦",
            "page": 23
        }
    ]
}'
说明：
在写入数据时，您可以登录 向量数据库控制台 的 Embedding 页签查看 Token 的处理速率，实时监控 Token 的变化趋势。具体信息，请参见 管理 Embedding 功能。
检索数据
如下示例，使用 /document/search 接口，在集合 book-emb 中，检索与 embeddingItems 参数的文本信息相似，且满足 Filter 表达式"bookName in (\\"三国演义\\",\\"西游记\\")"的文档。
ef 为 HWSN 索引类型对应的参数，指定寻找节点邻居遍历的范围，默认为200，ef 越大，召回率越高。 
outputFields 可配置所需输出的字段。
curl -i -X POST \\
  -H 'Content-Type: application/json' \\
  -H 'Authorization: Bearer account=root&api_key=A5VOgsMpGWJhUI0WmUbY********************' \\
  http://10.0.X.X:80/document/search \\
  -d '{
    "database": "db-test",
    "collection": "book-emb",
    "search": {
        "embeddingItems": [
            "天下大势，分久必合，合久必分"
        ],
        "limit": 3,
        "params": {
            "ef": 200
        },
        "retrieveVector": false,
        "filter": "bookName in (\\"三国演义\\",\\"西游记\\")",
        "outputFields": [
            "id",
            "author",
            "text",
            "bookName"
        ]
    }
}'
检索结果如下所示，其中，score 为相似性得分，使用 COSINE 进行相似度计算，其值越大越相似。text 字段为创建集合时定义的写入文本的字段名，存储原始文本。
{
    "code": 0,
    "msg": "operation success",
    "documents": [
        [
            {
                "id": "0001",
                "score": 0.9792741537094116,
                "bookName": "三国演义",
                "author": "罗贯中",
                "text": "话说天下大势，分久必合，合久必分。"
            },
            {
                "id": "0002",
                "score": 0.7909858226776123,
                "bookName": "西游记",
                "author": "吴承恩",
                "text": "混沌未分天地乱，茫茫渺渺无人间。"
            }
        ]
    ]
}
更新数据
使用 /document/update 接口更新数据，如下示例，通过 documentIds 与 filter 表达式过滤 Document，更新其 text 字段的文本信息，更新 page 字段值为 30，并新增字段 test_new_field，且 vector 字段的向量数据自动更新。
curl -i -X POST \\
  -H 'Content-Type: application/json' \\
  -H 'Authorization: Bearer account=root&api_key=A5VOgsMpGWJhUI0WmUbY********************' \\
  http://10.0.X.X:80/document/update \\
  -d '{
    "database": "db-test",
    "collection": "book-emb",
    "query": {
        "documentIds": [
            "0001",
            "0003"
        ],
        "filter": "bookName in (\\"三国演义\\",\\"西游记\\")"
    },
    "update": {
        "text": "合久必分，分久必合",
        "page": 30,
        "test_new_field": "new field value"
    }
}'
执行成功之后，返回如下信息。
{
    "code": 0,
    "msg": "operation success",
    "affectedCount": 1
}
通过 /document/query 查询 Document ID为 0001 的数据，确认更新的字段是否生效。返回如下信息，可看到 text 字段与 page 字段值已更新，新增字段 test_new_field 也已生效。
{
    "code": 0,
    "msg": "operation success",
    "count": 1,
    "documents": [
        {
            "id": "0001",
            "author": "罗贯中",
            "test_new_field": "new field value",
            "bookName": "三国演义",
            "page": 30,
            "text": "合久必分，分久必合"
        }
    ]
}
﻿
应用 Embedding 功能

本页目录：

创建 Collection 指定 Embedding 模型

写入文本数据

检索数据

更新数据