OpenAI的文本嵌入衡量文本字符串的相关性。嵌入通常用于:
嵌入是浮点数的向量(列表)。两个向量之间的距离衡量它们的相关性。小距离表示高相关性,大距离表示低相关性。 但是OpenAI的文本嵌入接口对中文的支持并不好,社区经过实践,对中文支持比较好的模型是Hugging face上的 ganymedenil/text2vec-large-chinese。具体可以参见:https://huggingface.co/GanymedeNil/text2vec-large-chinese/discussions/3 ,作者采用的训练数据集是 中文STS-B数据集。它将句子映射到 768 维密集向量空间,可用于任务 如句子嵌入、文本匹配或语义搜索。
在Semantic Kernel 的Sample 下有一个 hugging-face-http-server:https://github.com/microsoft/semantic-kernel/tree/main/samples/apps/hugging-face-http-server ,通过这个示例项目,我们可以在本地运行Huggingface模型。
我们首先构建一个Docker,执行命令 docker image build -t hf_model_server . ,最新的构建会有问题,我把它独立成一个repo :https://github.com/mlnethub/hugging-face-http-server。
运行容器
docker run -p 5000:5000 -d hf_model_server
通过从0.14版本引入的 Nuget 包 Microsoft.SemanticKernel.Connectors.AI.HuggingFace:https://www.nuget.org/packages/Microsoft.SemanticKernel.Connectors.AI.HuggingFace/0.14.547.1-preview#versions-body-tab
具体用法参考单元测试代码HuggingFaceEmbeddingGenerationTests
using System; using System.Collections.Generic; using System.Linq; using System.Net; using System.Net.Http; using System.Threading.Tasks; using Microsoft.SemanticKernel.Connectors.AI.HuggingFace.TextEmbedding; using Xunit;
namespace SemanticKernel.Connectors.UnitTests.HuggingFace.TextEmbedding;
/// <summary> /// Unit tests for <see cref="HuggingFaceTextEmbeddingGeneration"/> class. /// </summary> public class HuggingFaceEmbeddingGenerationTests : IDisposable { private const string Endpoint = "http://localhost:5000/embeddings"; private const string Model = @"GanymedeNil/text2vec-large-chinese";
private readonly HttpResponseMessage _response = new() { StatusCode = HttpStatusCode.OK, };
/// <summary> /// Verifies that <see cref="HuggingFaceTextEmbeddingGeneration.GenerateEmbeddingsAsync"/> /// returns expected list of generated embeddings without errors. /// </summary> [Fact] public async Task ItReturnsEmbeddingsCorrectlyAsync() { // Arrange const int ExpectedEmbeddingCount = 1; const int ExpectedVectorCount = 8; List<string> data = new() { "test_string_1", "test_string_2", "test_string_3" };
using var service = this.CreateService(HuggingFaceTestHelper.GetTestResponse("embeddings_test_response.json"));
// Act var embeddings = await service.GenerateEmbeddingsAsync(data);
// Assert Assert.NotNull(embeddings); Assert.Equal(ExpectedEmbeddingCount, embeddings.Count); Assert.Equal(ExpectedVectorCount, embeddings.First().Count); }
/// <summary> /// Initializes <see cref="HuggingFaceTextEmbeddingGeneration"/> with mocked <see cref="HttpClientHandler"/>. /// </summary> /// <param name="testResponse">Test response for <see cref="HttpClientHandler"/> to return.</param> private HuggingFaceTextEmbeddingGeneration CreateService(string testResponse) { this._response.Content = new StringContent(testResponse);
var httpClientHandler = HuggingFaceTestHelper.GetHttpClientHandlerMock(this._response);
return new HuggingFaceTextEmbeddingGeneration(new Uri(Endpoint), Model, httpClientHandler); }
public void Dispose() { this.Dispose(true); GC.SuppressFinalize(this); }
protected virtual void Dispose(bool disposing) { if (disposing) { this._response.Dispose(); } } }
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有