txtai简易教程

磐创AI

发布于 2021-08-05 10:14:38

1.7K0

发布于 2021-08-05 10:14:38

txtai执行机器学习工作流来转换数据，并构建支持人工智能的文本索引来执行相似性搜索。txtai支持索引文本片段、文档、音频和图像。管道和工作流支持使用机器学习模型转换数据。下面的文章提供了对txtai的介绍。

https://towardsdatascience.com/introducing-txtai-an-ai-powered-search-engine-built-on-transformers-37674be252ec

自2020年8月txtai首次发布以来，txtai已经有了很大的发展。除了构建嵌入索引之外，txtai现在还支持通过管道为索引准备数据转换、将管道连接在一起的工作流、用于JavaScript/Java/Rust/Go的API绑定以及向外扩展处理的能力。本文将介绍向量化数据、机器学习管道和工作流的方法。

向量化数据

txtai最初支持在文本部分建立索引。txtai现在支持文档、音频和图像。文档和音频将在下面的管道部分显示。本节将展示如何向量化图像和运行相似性搜索。

sentence-transformers最近增加了对OpenAI CLIP模型的支持。该模型将文本和图像嵌入到同一空间，实现了图像相似性搜索。txtai可以直接利用这些模型。

from txtai.embeddings import Embeddings

embeddings = Embeddings({"method": "transformers", "path": "clip-ViT-B-32", "modelhub": False})
embeddings.index(images(directory))

embeddings.search(query, 1)

上面的代码构建了图像目录的相似性索引，并使用query进行搜索。运行它对你自己的图像进行探索！

管道

txtai的管道框架提供了越来越多的模型。管道包装机器学习模型并转换数据。目前，管道可以封装Hugging Face Transformers模型、Hugging Face Transformers管道或PyTorch模型。

以下是当前实现的管道列表。

问题-使用文本上下文回答问题
标签-使用zero-shot分类模型将标签应用于文本，还支持相似性比较。
摘要-文本摘要
Textractor-从文档中提取文本
转录-将音频转录为文本
翻译-机器翻译

管道获取输入数据，应用NLP转换并返回结果。下面的笔记本将介绍上述每个管道的示例。

摘要

摘要利用自然语言处理（NLP）模型构建文本的摘要。这类似于让一个人读一篇文章，问它是关于什么的。人类不会对文本进行冗长的解读。让我们看一个例子。

from txtai.pipeline import Summary

# 摘要模型
summary = Summary()

text = ("Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation "
       "of the internet and an ever-growing challenge that is never solved or done. The field of Natural Language Processing (NLP) is "
       "rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability "
       "allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models "
       "and advancements coming in at what seems a weekly basis. This article introduces txtai, an AI-powered search engine "
       "that enables Natural Language Understanding (NLU) based search in any application."
)

summary(text, maxlength=10)

以上部分打印：

Search is the foundation of the internet

一个完整的例子可以在下面链接的笔记本中找到。

https://colab.research.google.com/github/neuml/txtai/blob/master/examples/09_Building_abstractive_text_summaries.ipynb

文本提取

本节介绍如何提取文档中的文本，以最好地支持相似性搜索。

from txtai.pipeline import Textractor

# 创建textractor模型
textractor = Textractor()

textractor("txtai/article.pdf")

以上部分打印：

Introducing txtai, an AI-powered search engine built on Transformers Add Natural Language Understanding to any application Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation.....

一个完整的例子可以在下面链接的笔记本中找到。此示例演示如何拆分文本，以帮助构建要索引的文本部分。

https://colab.research.google.com/github/neuml/txtai/blob/master/examples/10_Extract_text_from_documents.ipynb

音频转录

Hugging Face Transformers 提供了许多模型，可以执行音频转录（音频到文本）。

from txtai.pipeline import Transcription

# 创建转录模型
transcribe = Transcription("facebook/wav2vec2-large-960h")

transcribe("Make_huge_profits.wav")

以上部分打印：

Make huge profits without working make up to one hundred thousand dollars a day

一个完整的例子可以在下面链接的笔记本中找到。

https://colab.research.google.com/github/neuml/txtai/blob/master/examples/11_Transcribe_audio_to_text.ipynb

文本翻译

这一节涵盖了机器翻译支持Hugging Face Transformer模型。通过云服务进行机器翻译的质量已经取得了很大的进步，并产生了高质量的结果。下面展示了本地模型如何为开发人员提供合理的替代方案。

from txtai.pipeline import Translation

# 创建文本翻译模型
translate = Translation()

translate("This is a test translation into Spanish", "es")

以上部分打印：

Esta es una traducción de prueba al español

一个完整的例子可以在下面链接的笔记本中找到。

https://colab.research.google.com/github/neuml/txtai/blob/master/examples/12_Translate_text_between_languages.ipynb

工作流

管道非常好，可以使使用各种机器学习模型更加容易。但是如果我们想把不同管道的结果粘在一起呢？例如，提取文本，对其进行总结，将其翻译成英语并将其加载到嵌入索引中。这需要代码以有效的方式将这些操作连接在一起。

工作流简单而强大，它接受可调用对象并返回元素。工作流不知道它们正在使用管道，但是可以有效地处理管道数据。工作流本质上是流的，以批处理数据，允许高效地处理大量数据。

from txtai.pipeline import Transcription, Translation
from txtai.workflow import FileTask

# 转录实例
transcribe = Transcription("facebook/wav2vec2-large-960h")

# 创建一个翻译实例
translate = Translation()

tasks = [
    FileTask(transcribe, r"\.wav$"),
    Task(lambda x: translate(x, "fr"))
]

#  file://需要前缀来告诉工作流这是一个文件，而不是一个文本字符串
data = [
  "file://txtai/US_tops_5_million.wav",
  "file://txtai/Canadas_last_fully.wav",
  "file://txtai/Beijing_mobilises.wav",
  "file://txtai/The_National_Park.wav",
  "file://txtai/Maine_man_wins_1_mil.wav",
  "file://txtai/Make_huge_profits.wav"
]

# 工作流将文本翻译成法语
workflow = Workflow(tasks)

# 运行工作流
list(workflow(data))

上面的例子将音频转录成文本，然后将文本翻译成法语。

["Les cas de virus U sont en tête d'un million",
 "La dernière plate-forme de glace entièrement intacte du Canada s'est soudainement effondrée en formant un berge de glace de taille manhatten",
 "Bagage mobilise les embarcations d'invasion le long des côtes à mesure que les tensions tiwaniennes s'intensifient",
 "Le service des parcs nationaux met en garde contre le sacrifice d'amis plus lents dans une attaque nue",
 "L'homme principal gagne du billet de loterie",
 "Faire d'énormes profits sans travailler faire jusqu'à cent mille dollars par jour"]

下面的笔记本中可以找到这个示例和其他示例，包括一个复杂的工作流，该工作流进行文本摘要，将文本翻译成法语，然后构建嵌入索引。

https://colab.research.google.com/github/neuml/txtai/blob/master/examples/14_Run_pipeline_workflows.ipynb

结尾

所有讨论的功能现在都可以在GitHub的主分支中获得。txtai将继续快速发展，并将继续专注于增加新的管道。

txtai的目标是简单到可以在笔记本电脑上工作，但能够扩展到集群/云系统。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-30，如有侵权请联系 cloudcommunity@tencent.com 删除

bash