向量数据库查询文件信息

接口定义
Query() 用于精确查找与查询条件完全匹配的文件，可获取文件长度、向量化的进度与状态等，不包括文件内容。具体支持如下方式查找文件。
支持指定具体的文件名查找文件，或搭配文件 Metadata 信息对应字段的 Filter 表达式查询文件信息。
支持指定查询起始位置 offset 和返回数量 limit，查找指定范围的文件信息。
支持根据文件 Meta 信息对应字段 Filter 表达式，过滤需查找的文件。
Query(ctx context.Context, params tcvectordb.QueryAIDocumentSetParams) (*tcvectordb.QueryAIDocumentSetResult, error)
使用示例
使用文件名搭配 Filter 查询文件
查询指定范围的文件
根据存储于向量数据库的文件名，搭配标量字段 author 与 tags 的 Filter 表达式一并过滤文件。
import (
    "context"
    "encoding/json"
    "fmt"
    "log"
    "time"
    "github.com/tencent/vectordatabase-sdk-go/tcvectordb"
)
﻿
var (
    ctx                 = context.Background()
    aiDatabase         = "go-sdk-test-ai-db"
    collectionViewName = "go-sdk-test-ai-coll"
)
col := client.AIDatabase(aiDatabase).CollectionView(collectionViewName)
// 指定查询参数
param := tcvectordb.QueryAIDocumentSetParams{
  DocumentSetName: []string{"腾讯云向量数据库.pdf"},
  Filter:          tcvectordb.NewFilter(`author_name="sam"`),
  Limit:           3,
  Offset:          0,
  //OutputFields: []string{"indexedStatus", "textPrefix"},
}
result, _ := col.Query(ctx, param)
// 输出查询结果
log.Printf("Query success: %+v", result.Count)
for _, doc := range result.Documents {
    b, err := json.Marshal(doc)
    if err != nil {
        return
    }
    fmt.Println(fmt.Sprintf("res %v", string(b)))
}
res {"databaseName":"go-sdk-test-ai-db","collectionViewName":"go-sdk-test-ai-coll","documentSetId":"1313843877669244928","documentSetName":"腾讯云向量数据库.pdf","text":"","textPrefix":"![vdb-image](22ec6930-1629-4b18-94d9-cd5a3d27faf6.png)\\n\\n![vdb-image](67cb122b-5322-4931-905a-7c0ee2e7ce98.png)\\n\\n## 版权所有：腾讯云计算（北京）有限责任公司\\n\\n![vdb-image](50933877-8262-439f-96e6-a7e7aee2d841.png)\\n\\n\\n\\n![vdb","documentSetInfo":{"textLength":11945,"byteLength":1925502,"indexedProgress":21,"indexedStatus":"Loading","createTime":"2024-12-04 20:26:30","lastUpdateTime":"2024-12-04 20:26:59"},"ScalarFields":{"author_name":{"val":"sam"},"fileKey":{"val":1024}},"splitterPreprocess":{"appendTitleToChunk":false,"appendKeywordsToChunk":false},"parsingProcess":{"parsingType":"VisionModelParsing"}}
var (
    ctx                 = context.Background()
    aiDatabase         = "go-sdk-test-ai-db"
    collectionViewName = "go-sdk-test-ai-coll"
)
col := client.AIDatabase(aiDatabase).CollectionView(collectionViewName)
param := tcvectordb.QueryAIDocumentSetParams{
  Limit:           3,
  Offset:          0,
  // 使用OutputFields一定会输出documentSetId、documentSetName便于后续操作
  OutputFields: []string{"indexedStatus", "textPrefix"},
}
result, _ := col.Query(ctx, param)
log.Printf("GetDocumentSetByName success: %+v", result.Count)
﻿
入参描述
参数
是否必选
配置方法及要求
DocumentSetName 
否
表示要查询的文档的名称，支持批量查询，数组元素范围[1,20]。
DocumentSetId
否
表示要查询的文档的所有 ID，支持批量查询，数组元素范围[1,20]。
Filter
否
使用创建 CollectionView 指定的 Filter 索引的字段设置查询过滤表达式。Filter 表达式格式为 <field_name><operator><value>，多个表达式之间支持 and（与）、or（或）、not（非）关系。具体信息，请参见 Filter 条件表达式。其中： 
<field_name>：表示要过滤的字段名。
<operator>：表示要使用的运算符。
string ：匹配单个字符串值（=）、排除单个字符串值（!=）、匹配任意一个字符串值（in）、排除所有字符串值（not in）。其对应的 Value 必须使用英文双引号括起来。
uint64：大于（>）、大于等于（>=）、等于（=）、小于（<）、小于等于（<=）。例如：expired_time > 1623388524。
array：数组类型，包含数组元素之一（include）、排除数组元素之一（exclude）、全包含数组元素（include all）。例如，name include (\\"Bob\\", \\"Jack\\")。
<value>：表示要匹配的值。
示例：Filter('author="jerry"').And('page>20')。
Limit
是
每页返回的 DocumentSet 数量。
数据类型：uint 64。
默认值：10。
取值范围：[1,16384]。
注意：
若不配置任何查询条件，即 doc_list = coll_view.query()，则默认返回 10 个 DocumentSet。
若查询条件仅配置 Filter 表达式，不配置 limit，则默认返回 10 条 DocumentSet。
若查询条件仅设置 document_set_name 或 document_set_id，则可不配置 limit 参数，默认返回 10 条数据。
Offset
否
设置分页偏移量，用于控制分页查询返回结果的起始位置，方便用户对数据进行分页展示和浏览。
取值：为 limit 整数倍。
计算公式：offset = limit * (page-1)。
例如：当 limit = 10，page = 2 时，分页偏移量 offset = 10 * (2 - 1) = 10，表示从查询结果的第11条记录开始返回数据。
OutputFields
否
以数组形式配置需返回的字段。
出参描述
res {"databaseName":"go-sdk-test-ai-db","collectionViewName":"go-sdk-test-ai-coll","documentSetId":"1313843877669244928","documentSetName":"腾讯云向量数据库.pdf","text":"","textPrefix":"![vdb-image](22ec6930-1629-4b18-94d9-cd5a3d27faf6.png)\\n\\n![vdb-image](67cb122b-5322-4931-905a-7c0ee2e7ce98.png)\\n\\n## 版权所有：腾讯云计算（北京）有限责任公司\\n\\n![vdb-image](50933877-8262-439f-96e6-a7e7aee2d841.png)\\n\\n\\n\\n![vdb","documentSetInfo":{"textLength":11945,"byteLength":1925502,"indexedProgress":21,"indexedStatus":"Loading","createTime":"2024-12-04 20:26:30","lastUpdateTime":"2024-12-04 20:26:59"},"ScalarFields":{"author_name":{"val":"sam"},"fileKey":{"val":1024}},"splitterPreprocess":{"appendTitleToChunk":false,"appendKeywordsToChunk":false},"parsingProcess":{"parsingType":"VisionModelParsing"}}
参数名
子参数
参数含义
Count
-
查询到的文件数量。
databaseName
-
数据库名。
collectionViewName
-
集合视图名。
documnetSetId
-
文件 ID。
documnetSetName
-
文件名。
text
-
该参数为空。
textPrefix
-
文件内容前 200个字符。
documentSetInfo
textLength
文件的字符数。
﻿
byteLength
文件的字节数。
﻿
indexedProgress
文件被预处理、Embedding 向量化的进度。
﻿

indexedStatus

文件预处理、Embedding 向量化的状态。
New：等待解析。
Loading：文件解析中。
Failure：文件解析、写入出错。
Ready：文件解析、写入完成。
﻿

indexedErrorMsg

文件解析、写入错误描述信息。
说明：
当 IndexedStatus 为 Failure 时，返回 indexedErrorMsg 信息。
﻿
createTime
文件创建时间。
﻿
lastUpdateTime
文件最后更新时间。
﻿
keywords
文件关键字。
ScalarFields
-
自定义的文件 Metadata 信息字段。
splitterPreprocess
appendTitleToChunk
在对文件拆分时，配置是否将 Title 追加到切分后的段落后面一并 Embedding。取值如下所示：
false：不追加。
true：将段落 Title 追加到切分后的段落。
﻿
appendKeywordsToChunk
在对文件拆分时，配置是否将关键字 keywords 追加到切分后的段落一并 Embedding。取值如下所示：
false：不追加。
true：将全文的 keywords 追加到切分后的段落。
parsingProcess
parsingType
指定 PDF 类型文件的解析方式，取值如下所示：
VisionModelParsing：文件依据解析模型解析，推荐使用，可解析 PDF 中双栏、表格等复杂格式。
AlgorithmParsing：文件依据算法解析，系统默认解析方式。Markdown、Word、PPT 类型，无需配置该参数，默认使用 AlgorithmParsing 解析。

参数	是否必选	配置方法及要求
`DocumentSetName`	否	表示要查询的文档的名称，支持批量查询，数组元素范围[1,20]。
`DocumentSetId`	否	表示要查询的文档的所有 ID，支持批量查询，数组元素范围[1,20]。
`Filter`	否	使用创建 CollectionView 指定的 Filter 索引的字段设置查询过滤表达式。Filter 表达式格式为 <field_name><operator><value>，多个表达式之间支持 and（与）、or（或）、not（非）关系。具体信息，请参见 Filter 条件表达式。其中： <field_name>：表示要过滤的字段名。 <operator>：表示要使用的运算符。 string ：匹配单个字符串值（=）、排除单个字符串值（!=）、匹配任意一个字符串值（in）、排除所有字符串值（not in）。其对应的 Value 必须使用英文双引号括起来。 uint64：大于（>）、大于等于（>=）、等于（=）、小于（<）、小于等于（<=）。例如：expired_time > 1623388524。 array：数组类型，包含数组元素之一（include）、排除数组元素之一（exclude）、全包含数组元素（include all）。例如，name include (\\"Bob\\", \\"Jack\\")。 <value>：表示要匹配的值。示例：`Filter('author="jerry"').And('page>20')。`
`Limit`	是	每页返回的 DocumentSet 数量。数据类型：uint 64。默认值：10。取值范围：[1,16384]。注意：若不配置任何查询条件，即 `doc_list = coll_view.query()`，则默认返回 10 个 DocumentSet。若查询条件仅配置 Filter 表达式，不配置 limit，则默认返回 10 条 DocumentSet。若查询条件仅设置 document_set_name 或 document_set_id，则可不配置 limit 参数，默认返回 10 条数据。
`Offset`	否	设置分页偏移量，用于控制分页查询返回结果的起始位置，方便用户对数据进行分页展示和浏览。取值：为 limit 整数倍。计算公式：offset = limit * (page-1)。例如：当 limit = 10，page = 2 时，分页偏移量 offset = 10 * (2 - 1) = 10，表示从查询结果的第11条记录开始返回数据。
`OutputFields`	否	以数组形式配置需返回的字段。

参数名	子参数	参数含义
Count	-	查询到的文件数量。
databaseName	-	数据库名。
collectionViewName	-	集合视图名。
documnetSetId	-	文件 ID。
documnetSetName	-	文件名。
text	-	该参数为空。
textPrefix	-	文件内容前 200个字符。
documentSetInfo	textLength	文件的字符数。
		byteLength	文件的字节数。
		indexedProgress	文件被预处理、Embedding 向量化的进度。
		indexedStatus	文件预处理、Embedding 向量化的状态。 New：等待解析。 Loading：文件解析中。 Failure：文件解析、写入出错。 Ready：文件解析、写入完成。
		indexedErrorMsg	文件解析、写入错误描述信息。说明：当 IndexedStatus 为 Failure 时，返回 indexedErrorMsg 信息。
		createTime	文件创建时间。
		lastUpdateTime	文件最后更新时间。
		keywords	文件关键字。
ScalarFields	-	自定义的文件 Metadata 信息字段。
splitterPreprocess	appendTitleToChunk	在对文件拆分时，配置是否将 Title 追加到切分后的段落后面一并 Embedding。取值如下所示： false：不追加。 true：将段落 Title 追加到切分后的段落。
splitterPreprocess		appendKeywordsToChunk	在对文件拆分时，配置是否将关键字 keywords 追加到切分后的段落一并 Embedding。取值如下所示： false：不追加。 true：将全文的 keywords 追加到切分后的段落。
parsingProcess	parsingType	指定 PDF 类型文件的解析方式，取值如下所示： VisionModelParsing：文件依据解析模型解析，推荐使用，可解析 PDF 中双栏、表格等复杂格式。 AlgorithmParsing：文件依据算法解析，系统默认解析方式。Markdown、Word、PPT 类型，无需配置该参数，默认使用 AlgorithmParsing 解析。

查询文件信息

本页目录：

接口定义

使用示例

入参描述

出参描述