「最佳实践」通过ES的机器学习功能，实现一站式NLP语义聚合

原创

Rassyan

修改于 2024-05-14 10:05:19

9781

文章被收录于专栏：腾讯云Elasticsearch Service腾讯云Elasticsearch Service

引言

随着腾讯云ES 8.8.1及其后续版本8.11.3、8.13.3的推出，腾讯云ES在人工智能、向量搜索和自然语言处理（NLP）等领域功能得到了显著的增强。这些新功能为开发者提供了更多的可能性，尤其是在处理复杂的NLP任务时。本文将探讨如何利用腾讯云ES的机器学习功能，实现一站式的NLP语义聚合，并通过demo来实践来这一过程。

语义聚合的挑战

语义聚合，就是将多个文档中的文本，从表达意义上进行归类。举个简单的例子来理解，比如“我爱中国”，“我喜欢钻研技术”，都属于积极表述，而“我讨厌雨天”，“我很生气”，都属于消极的表述。

ES传统的文本聚合方法依赖于文本中的共同value或term，而表述各异的文本几乎不存在相同的value，即便对text字段开启fielddata，利用不同文档分词后会产生相同的term，这种归类方式仅仅是表面的词汇聚类，也无法达成语义上的聚合归类。

我们知道，通过将文本转换为向量表示，我们可以捕捉到文本的语义信息，利用这些信息ES可以进行更加精准的搜索。那么聚合呢？用于存储向量化的字段类型dense_vector是不支持聚合的。这是因为向量字段不同于传统的文本、数值型字段，不同的原文的embedding向量几乎不会有相同的取值，密集向量类型的值的分布是“稀疏”的，这使得对其进行聚合既缺乏意义，也在技术上难以实现。

利用ES机器学习功能的最佳实践

ES的机器学习功能提供了一种解决方案。从官方这篇文档，Classify text，可以了解到ES的机器学习功能，除了支持向量化模型推理外，还支持文本分类模型的推理。那么利用这一点，我们可以使用文本分类模型对文本数据打上语义“标签”，从而使传统的ES聚合能力得以应用于语义聚合。

我们动手尝试一个demo，在Hugging Face上查找Text Classification类的模型，比如这个情感分析的文本分类模型，它可以推理一段文字的情感表达类型。

我们通过eland工具，将该模型导入ES，在可以访问公网的机器上执行 docker run -it -e HF_ENDPOINT=https://hf-mirror.com --rm docker.elastic.co/eland/eland eland_import_hub_model --url http://9.99.64.21:9200 -u elastic -p elastic_123 --hub-model-id SamLowe/roberta-base-go_emotions --start --insecure --task-type text_classification，从Hugging Face上拉取模型导入ES。

导入模型后，在Kibana的模型管理页面中，可以看到该模型，我们直接点击Deploy对该模型进行部署。

同样在Kibana的Ingest Pipelines管理页面，我们可以定义一个用于“文本分类”的推理管道，按照图示简单填写即可。这个管道将在数据写入时自动应用模型，为文本数据添加语义标签。

创建一个demo用的索引。

PUT text_classification_demo
{
  "mappings": {
    "dynamic_templates": [
      {
        "message": {
          "match": "text",
          "mapping": {
            "fields": {
              "keyword": {
                "ignore_above": 2048,
                "type": "keyword"
              }
            },
            "type": "text"
          }
        }
      },
      {
        "strings": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword"
          }
        }
      }
    ]
  },
  "settings": {
    "index": {
      "refresh_interval": "1s",
      "number_of_shards": "2"
    }
  }
}

结合Hugging Face上开源的测试集，我们简单编写一个python脚本，将文本数据批量写入ES索引，并指定推理管道。安装python依赖pip install datasets elasticsearch，demo代码如下。

import json

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch.helpers import BulkIndexError
from datasets import load_dataset

dataset = load_dataset("go_emotions", "simplified", split='train')

es_username = 'elastic'
es_password = 'elastic_123'  # 修改ES密码
es_host = '9.99.64.21'  # 修改ES HOST
es_port = 9200

es = Elasticsearch(
    hosts=[{'host': es_host, 'port': es_port, 'scheme': 'http'}],
    basic_auth=(es_username, es_password),
)


def prepare_data(dataset):
    for doc in dataset.select(range(40000)):
        yield {
            '_index': 'text_classification_demo',
            '_id': doc['id'],
            '_source': json.dumps(doc)
        }


def bulk_insert(data, chunk_size=100):
    try:
        success, _ = bulk(es, data, chunk_size=chunk_size, stats_only=True,
                          pipeline="text_classification_inference_pipeline",
                          request_timeout=600)
        print(f"Successfully indexed {success} documents.")
    except BulkIndexError as e:
        print(f"{len(e.errors)} document(s) failed to index.")
        for error in e.errors:
            print("Error details:", error)


bulk_insert(prepare_data(dataset))

运行脚本，我们会从Hugging Face的测试集中拉取40000条文本，通过_bulk的方式，指定ingest pipeline为我们刚刚定义的text_classification_inference_pipeline，将数据写入ES的text_classification_demo索引。