前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >一点点spaCy思想食物:易于使用的NLP框架

一点点spaCy思想食物:易于使用的NLP框架

作者头像
代码医生工作室
发布2019-06-22 14:34:59
1.2K0
发布2019-06-22 14:34:59
举报
文章被收录于专栏:相约机器人

在下面的文章中,将了解如何以快速简便的方式开始使用spaCy。它对NLP领域的初学者爱好者特别有用,并提供逐步说明和明亮的例子。

spaCy是一个NLP框架,由Explosion AI于2015年2月发布。它被认为是世界上最快的。易于使用并具有使用神经网络的能力是其他优点。

步骤1:安装spaCy

打开终端(命令提示符)并写入:

代码语言:javascript
复制
pip install spacy

步骤2:下载语言模型

编写以下命令

代码语言:javascript
复制
python -m spacy download en_core_web_lg

模型(en_core_web_lg)是spaCy最大的英文模型,大小为788 MB。英语中有较小的模型,其他语言有一些其他模型(英语,德语,法语,西班牙语,葡萄牙语,意大利语,荷兰语,希腊语)。

步骤3:导入库并加载模型

在python编辑器中编写以下行之后,已准备好了一些NLP乐趣:

代码语言:javascript
复制
import spacynlp = spacy.load(‘en_core_web_lg’)

步骤4:创建示例文本

代码语言:javascript
复制
sample_text = “Mark Zuckerberg took two days to testify before members of Congress last week, and he apologised for privacy breaches on Facebook. He said that the social media website didnot take a broad enough view of its responsibility, which was a big mistake. He continued to take responsibility for Facebook, saying that he started it, runs it, and he is responsible for what happens at the company. Illinois Senator Dick Durbin asked Zuckerberg whether he would be comfortable sharing the name of the hotel where he stayed the previous night, or the names of the people who he messaged that week. The CEO was startled by the question, and he took about 7 seconds to respond with no.”doc = nlp(sample_text)

步骤5:拆分段落的句子

将这个文本分成句子,并在每个句子的末尾写下每个句子的字符长度:

代码语言:javascript
复制
sentences = list(doc3.sents)for i in range(len(sentences)): print(sentences[i].text)  print(“Number of characters:”, len(sentences[i].text)) print(“ — — — — — — — — — — — — — — — — — -”)

输出:

代码语言:javascript
复制
Mark Zuckerberg took two days to testify before members of Congress last week, and he apologised for privacy breaches on Facebook.Number of characters: 130-----------------------------------He said that the social media website did not take a broad enough view of its responsibility, which was a big mistake.Number of characters: 118-----------------------------------He continued to take responsibility for Facebook, saying that he started it, runs it, and he is responsible for what happens at the company.Number of characters: 140-----------------------------------Illinois Senator Dick Durbin asked Zuckerberg whether he would be comfortable sharing the name of the hotel where he stayed the previous night, or the names of the people who he messaged that week.Number of characters: 197-----------------------------------The CEO was startled by the question, and he took about 7 seconds to respond with no.Number of characters: 85-----------------------------------

步骤6:实体识别

实体识别性能是NLP模型的重要评估标准。spaCy通过一行代码实现它并且非常成功:

代码语言:javascript
复制
from spacy import displacydisplacy.render(doc, style=’ent’, jupyter=True)

输出:

步骤7:标记化和词性标注

标记文本并查看每个标记的一些属性:

代码语言:javascript
复制
for token in doc: print(“{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}”.format( token.text, token.idx, token.lemma_, token.is_punct, token.is_space, token.shape_, token.pos_, token.tag_ ))

输出:

代码语言:javascript
复制
Mark0markFalseFalseXxxxPROPNNNPZucker.5zucker.FalseFalseXxxxxPROPNNNPtook16takeFalseFalsexxxxVERBVBDtwo21twoFalseFalsexxxNUMCDdays25dayFalseFalsexxxxNOUNNNSto30toFalseFalsexxPARTTOtestify33testifyFalseFalsexxxxVERBVBbefore41beforeFalseFalsexxxxADPINmembers48memberFalseFalsexxxxNOUNNNSof56ofFalseFalsexxADPIN

同样它很容易应用并立即给出令人满意的结果。关于打印出的属性的简要说明:

代码语言:javascript
复制
text: token itselfidx: starting byte of the tokenlemma_: root of the wordis_punct: is it a punctuation symbol or notis_space: is it a space or notshape_: shape of the token to show which letter is the capitalpos_: the simple part of speech tagtag_: the detailed part of speech tag

什么是语音标签?

它是在将整个文本拆分成标记之后为每个标记分配标记的过程,如名词,动词,形容词。

步骤8:只有数字

当处理语言和文本时,数字来自何处?

由于机器需要将所有内容转换为数字以理解世界,因此每个单词都由NLP世界中的数组(单词向量)表示。这是spaCy词典中“man”的单词vector:

代码语言:javascript
复制
[-1.7310e-01,  2.0663e-01,  1.6543e-02, ....., -7.3803e-02]

spaCy的单词向量的长度是300.它可以在其他框架中有所不同。

在建立了单词向量之后,可以观察到上下文相似的单词在数学上也是相似的。这里有些例子:

代码语言:javascript
复制
from scipy import spatialcosine_similarity = lambda x, y: 1 — spatial.distance.cosine(x, y)print(“apple vs banana: “, cosine_similarity(nlp.vocab[‘apple’].vector, nlp.vocab[‘banana’].vector))print(“car vs banana: “, cosine_similarity(nlp.vocab[‘car’].vector, nlp.vocab[‘banana’].vector))print(“car vs bus: “, cosine_similarity(nlp.vocab[‘car’].vector, nlp.vocab[‘bus’].vector))print(“tomatos vs banana: “, cosine_similarity(nlp.vocab[‘tomatos’].vector, nlp.vocab[‘banana’].vector))print(“tomatos vs cucumber: “, cosine_similarity(nlp.vocab[‘tomatos’].vector, nlp.vocab[‘cucumber’].vector))

输出:

代码语言:javascript
复制
apple vs banana:  0.5831844210624695car vs banana:  0.16172660887241364car vs bus:  0.48169606924057007tomatos vs banana:  0.38079631328582764tomatos vs cucumber:  0.5478045344352722

令人印象深刻的?当比较两种水果或蔬菜或两种车辆时,相似性更高。当两个不相关的物体如汽车与香蕉相比时,相似性相当低。当检查西红柿和香蕉的相似性时,观察到它高于汽车与香蕉的相似性,但低于西红柿对黄瓜和苹果对香蕉的反映现实。

步骤9:国王=女王+(男人 - 女人)?

如果一切都用数字表示,如果可以用数学方法计算相似性,可以做一些其他的计算吗?例如,如果从“男人”中减去“女人”并将差异添加到“女王”中,能找到“国王”吗?试试吧:

代码语言:javascript
复制
from scipy import spatial cosine_similarity = lambda x, y: 1 — spatial.distance.cosine(x, y) man = nlp.vocab[‘man’].vectorwoman = nlp.vocab[‘woman’].vectorqueen = nlp.vocab[‘queen’].vectorking = nlp.vocab[‘king’].vectorcalculated_king = man — woman + queenprint(“similarity between our calculated king vector and real king vector:”, cosine_similarity(calculated_king, king))

输出:

代码语言:javascript
复制
similarity between our calculated king vector and real king vector: 0.771614134311676

可以尝试使用不同的替代词,并观察类似的有希望的结果。

结论

本文的目的是对spaCy框架进行简单而简要的介绍,并展示一些简单的NLP应用程序示例。希望这是有益的。可以在设计精良且信息丰富的网站中找到详细信息和大量示例。

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2019-05-17,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 相约机器人 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
NLP 服务
NLP 服务(Natural Language Process,NLP)深度整合了腾讯内部的 NLP 技术,提供多项智能文本处理和文本生成能力,包括词法分析、相似词召回、词相似度、句子相似度、文本润色、句子纠错、文本补全、句子生成等。满足各行业的文本智能需求。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档