专栏首页NLP小白的学习历程文本处理工具 - TextBlob

文本处理工具 - TextBlob

TextBlob基本介绍

TextBlob是一个用Python编写的开源的文本处理库。它可以用来执行很多自然语言处理的任务,比如,词性标注,名词性成分提取,情感分析,文本翻译,等等。你可以在官方文档阅读TextBlog的所有特性。

基本功能

  • Noun phrase extraction 短语提取
  • Part-of-speech tagging 词汇标注
  • Sentiment analysis 情感分析
  • Classification (Naive Bayes, Decision Tree) 分类
  • Language translation and detection powered by Google Translate 语言翻译和检查(谷歌翻译支持)
  • Tokenization (splitting text into words and sentences) 分词、分句
  • Word and phrase frequencies 词、短语频率
  • Parsing 语法分析
  • n-grams N元标注
  • Word inflection (pluralization and singularization) and lemmatization 词反射及词干提取
  • Spelling correction 拼写准确性
  • Add new models or languages through extensions 添加新模型或语言通过表达
  • WordNet integration WordNet整合

快速开始:

Create a TextBlob(创建一个textblob对象)

First, the import. TextBlob 类

>>> from textblob import TextBlob

Let’s create our first TextBlob.

>>> wiki = TextBlob("Python is a high-level, general-purpose programming language.")

Part-of-speech Tagging(词性标注)

Part-of-speech tags can be accessed through the tags property.

>>> wiki.tags
[('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('high-level', 'JJ'), ('general-purpose', 'JJ'), ('programming', 'NN'), ('language', 'NN')]

Noun Phrase Extraction(名词短语列表)

Similarly, noun phrases are accessed through the noun_phrases property. 注意:只提取名词短语

>>> wiki.noun_phrases
WordList(['python'])

Sentiment Analysis(情感分析)

返回一个元组 Sentiment(polarity, subjectivity).

The polarity score is a float within the range [-1.0, 1.0]. -1.0 消极,1.0积极

The subjectivity is a float within the range [0.0, 1.0] 0.0 表示客观,1.0表示主观.

>>> testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!")
>>> testimonial.sentiment
Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)
>>> testimonial.sentiment.polarity
0.39166666666666666

Tokenization(分词和分句)

You can break TextBlobs into words or sentences.

>>> zen = TextBlob("Beautiful is better than ugly. "
...                "Explicit is better than implicit. "
...                "Simple is better than complex.")
>>> zen.words
WordList(['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex'])
>>> zen.sentences
[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]

Sentence 对象 和TextBlobs 一样,有相同的方法和属性.

>>> for sentence in zen.sentences:
...     print(sentence.sentiment)

Words Inflection and Lemmatization(词反射及词干提取:单复数、过去式等)

Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful methods, e.g. for word inflection.

singularize() 变单数, pluralize()变复数,用在对名词进行处理,且会考虑特殊名词单复数形式
>>> sentence = TextBlob('Use 4 spaces per indentation level.')
>>> sentence.words
WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])
>>> sentence.words[2].singularize()
'space'
>>> sentence.words[-1].pluralize()
'levels'

Word 类 :lemmatize() 方法 对单词进行词形还原,名词找单数,动词找原型。所以需要一次处理名词,一次处理动词

>>> from textblob import Word
>>> w = Word("octopi")
>>> w.lemmatize()     # 默认只处理名词
'octopus'
>>> w = Word("went")
>>> w.lemmatize("v")  # 对动词原型处理
'go'

WordNet Integration (WordNet整合)

You can access the synsets for a Word via the synsets 属性 或者用 get_synsets 方法只查看部分或全部synset.

>>> from textblob import Word
>>> from textblob.wordnet import VERB
>>> word = Word("octopus")
>>> word.synsets
[Synset('octopus.n.01'), Synset('octopus.n.02')]
>>> Word("hack").get_synsets(pos=VERB)    # 只查找 该词作为 动词 的集合,参数为空时和synsets方法相同
[Synset('chop.v.05'), Synset('hack.v.02'), Synset('hack.v.03'), Synset('hack.v.04'), Synset('hack.v.05'), Synset('hack.v.06'), Synset('hack.v.07'), Synset('hack.v.08')]

You can access the definitions for each synset via the definitions property or the define()method, which can also take an optional part-of-speech argument.

>>> Word("octopus").definitions  #单词“章鱼”的定义
['tentacles of octopus prepared as food', 'bottom-living cephalopod having a soft oval body with eight long tentacles']    # '章鱼的触手是食物','底硒头足类动物,身体软而呈卵形,有八只长触须'

You can also create synsets directly.

>>> from textblob.wordnet import Synset
>>> octopus = Synset('octopus.n.02')
>>> shrimp = Synset('shrimp.n.03')
>>> octopus.path_similarity(shrimp)
0.1111111111111111

For more information on the WordNet API, see the NLTK documentation on the Wordnet Interface.

WordLists

A WordList is just a Python list with additional methods. 属性words : 一个包含句子分词的list

>>> animals = TextBlob("cat dog octopus")
>>> animals.words
WordList(['cat', 'dog', 'octopus'])
>>> animals.words.pluralize()
WordList(['cats', 'dogs', 'octopodes'])

Spelling Correction(拼写校正)

Use the correct() method to attempt spelling correction.

>>> b = TextBlob("I havv goood speling!")
>>> print(b.correct())
I have good spelling!

Word objects have a spellcheck() Word.spellcheck() method that returns a list of (word,confidence) tuples with spelling suggestions.

>>> from textblob import Word
>>> w = Word('falibility')
>>> w.spellcheck()
[('fallibility', 1.0)]

Spelling correction is based on Peter Norvig’s “How to Write a Spelling Corrector”[1] as implemented in the pattern library. It is about 70% accurate [2].

Get Word and Noun Phrase Frequencies(单词词频)

There are two ways to get the frequency of a word or noun phrase in a TextBlob. 两种方法来获取单词频次

The first is through the word_counts dictionary. 从属性word_counts 字典获取

>>> monty = TextBlob("We are no longer the Knights who say Ni. "
...                     "We are now the Knights who say Ekki ekki ekki PTANG.")
>>> monty.word_counts['ekki']
3

If you access the frequencies this way, the search will not be case sensitive, and words that are not found will have a frequency of 0.

The second way is to use the count() method. 用count ()方法获取

>>> monty.words.count('ekki')                  #单词频次
3

You can specify whether or not the search should be case-sensitive (default is False).

>>> monty.words.count('ekki', case_sensitive=True)   #设置大小写敏感,默认不区分
2

Each of these methods can also be used with noun phrases.

>>> wiki.noun_phrases.count('python')   #短语频次
1

Translation and Language Detection(翻译及语言检测语言)

New in version 0.5.0.

TextBlobs can be translated between languages.

>>> en_blob = TextBlob(u'Simple is better than complex.')
>>> en_blob.translate(to='es')
TextBlob("Simple es mejor que complejo.")

If no source language is specified, TextBlob will attempt to detect the language. You can specify the source language explicitly, like so. Raises TranslatorError if the TextBlob cannot be translated into the requested language or NotTranslated if the translated result is the same as the input string.

>>> chinese_blob = TextBlob(u"美丽优于丑陋")
>>> chinese_blob.translate(from_lang="zh-CN", to='en')
TextBlob("Beautiful is better than ugly")

You can also attempt to detect a TextBlob’s language using TextBlob.detect_language().

>>> b = TextBlob(u"بسيط هو أفضل من مجمع")
>>> b.detect_language()
'ar'

As a reference, language codes can be found here.

Language translation and detection is powered by the Google Translate API.

Parsing(解析)

Use the parse() method to parse the text. 句法解析 parse() 方法

>>> b = TextBlob("And now for something completely different.")
>>> print(b.parse())
And/CC/O/O now/RB/B-ADVP/O for/IN/B-PP/B-PNP something/NN/B-NP/I-PNP completely/RB/B-ADJP/O different/JJ/I-ADJP/O ././O/O

By default, TextBlob uses pattern’s parser [3].

TextBlobs Are Like Python Strings!(TextBlobs像是字符串)

You can use Python’s substring syntax.

>>> zen[0:19]
TextBlob("Beautiful is better")

You can use common string methods.

>>> zen.upper()
TextBlob("BEAUTIFUL IS BETTER THAN UGLY. EXPLICIT IS BETTER THAN IMPLICIT. SIMPLE IS BETTER THAN COMPLEX.")
>>> zen.find("Simple")
65

You can make comparisons between TextBlobs and strings.

>>> apple_blob = TextBlob('apples')
>>> banana_blob = TextBlob('bananas')
>>> apple_blob < banana_blob
True
>>> apple_blob == 'apples'
True

You can concatenate and interpolate TextBlobs and strings.

>>> apple_blob + ' and ' + banana_blob
TextBlob("apples and bananas")
>>> "{0} and {1}".format(apple_blob, banana_blob)
'apples and bananas'

n-grams(提取前n个字)

The TextBlob.ngrams() method returns a list of tuples of n successive words.

ngrams(n) 方法返回 句子每 n 个连续单词为一个元素的 list

>>> blob = TextBlob("Now is better than never.")
>>> blob.ngrams(n=3)
[WordList(['Now', 'is', 'better']), WordList(['is', 'better', 'than']), WordList(['better', 'than', 'never'])]

Get Start and End Indices of Sentences(句子开始和结束的索引)

Use sentence.start and sentence.end to get the indices where a sentence starts and ends within a TextBlob.

>>> for s in zen.sentences:
...     print(s)
...     print("---- Starts at index {}, Ends at index {}".format(s.start, s.end))
Beautiful is better than ugly.
---- Starts at index 0, Ends at index 30
Explicit is better than implicit.
---- Starts at index 31, Ends at index 64
Simple is better than complex.
---- Starts at index 65, Ends at index 95

文档

TextBlob is a Python library for processing textual data. It provides a simple API for diving into common (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

[html] view plain copy

  1. from textblob import TextBlob
  2. text = '''
  3. The titular threat of The Blob has always struck me as the ultimate movie
  4. monster: an insatiably hungry, amoeba-like mass able to penetrate
  5. virtually any safeguard, capable of--as a doomed doctor chillingly
  6. describes it--"assimilating flesh on contact.
  7. Snide comparisons to gelatin be damned, it's a concept with the most
  8. devastating of potential consequences, not unlike the grey goo scenario
  9. proposed by technological theorists fearful of
  10. artificial intelligence run rampant.
  11. '''
  12. blob = TextBlob(text)
  13. blob.tags # [('The', 'DT'), ('titular', 'JJ'),
  14. # ('threat', 'NN'), ('of', 'IN'), ...]
  15. blob.noun_phrases # WordList(['titular threat', 'blob',
  16. # 'ultimate movie monster',
  17. # 'amoeba-like mass', ...])
  18. for sentence in blob.sentences:
  19. print(sentence.sentiment.polarity)
  20. # 0.060
  21. # -0.341
  22. blob.translate(to="es") # 'La amenaza titular de The Blob...

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

Features

  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • Add new models or languages through extensions
  • WordNet integration

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 随机梯度下降之——SGD自适应学习率

    http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentvariants

    种花家的奋斗兔
  • matplotlib中matshow和imshow的区别

    https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.matshow.html#matplotlib....

    种花家的奋斗兔
  • github使用教程

    The Hello World project is a time-honored tradition in computer programming. It ...

    种花家的奋斗兔
  • 可数加性效应理论中确定性模型和概率模型的二分法(CS AI)

    效应理论是一种相对新的范畴逻辑方法,可以看作是广义概率理论(GPTs)的抽象形式。 虽然GPT的标量总是真实的单位间隔[0,1],但在一个效应中,它们可以形成任...

    时代在召唤
  • 张亚勤寄语哥伦比亚大学2020年毕业生:引领未知时代

    5月18日,人工智能和数字视频的世界级科学家和企业家,美国艺术与科学院院士、百度前总裁、清华大学智能科学讲席教授张亚勤博士,在哥伦比亚大学工学院的毕业典礼上发表...

    数据猿
  • ResNet论文翻译——中英文对照

    Deep Residual Learning for Image Recognition Abstract Deeper neural networks are...

    Tyan
  • 【论文推荐】最新八篇生成对抗网络相关论文—离散数据生成、设计灵感、语音波形合成、去模糊、视觉描述、语音转换、对齐方法、注意力

    【导读】专知内容组整理了最近八篇生成对抗网络(Generative Adversarial Networks )相关文章,为大家进行介绍,欢迎查看! 1.Cor...

    WZEARW
  • 基于gpt-2模型(117M预训练模型)的文本自动生成测试

    openai的gpt-2模型最近在风口浪尖上。Language Models are Unsupervised Multitask Learners论文已经出来...

    sparkexpert
  • BART原理简介与代码实战

    最近huggingface的transformer库,增加了BART模型,Bart是该库中最早的Seq2Seq模型之一,在文本生成任务,例如摘要抽取方面达到了S...

    NewBeeNLP
  • 【论文推荐】最新六篇图像分割相关论文—控制、全卷积网络、子空间表示、多模态图像分割

    【导读】专知内容组整理了最近六篇图像分割(Image Segmentation)相关文章,为大家进行介绍,欢迎查看! 1.Virtual-to-Real: Le...

    WZEARW

扫码关注云+社区

领取腾讯云代金券