前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >【Kaggle微课程】Natural Language Processing - 1. Intro to NLP

【Kaggle微课程】Natural Language Processing - 1. Intro to NLP

作者头像
Michael阿明
发布2021-02-19 10:55:26
6030
发布2021-02-19 10:55:26
举报
文章被收录于专栏:Michael阿明学习之路

文章目录

learn from https://www.kaggle.com/learn/natural-language-processing

1. 使用 spacy 库进行 NLP

spacy:https://spacy.io/usage

  • spacy 需要指定语言种类,使用spacy.load()加载语言

管理员身份打开 cmd 输入python -m spacy download en 下载英语语言en模型

代码语言:javascript
复制
import spacy
nlp = spacy.load('en')
  • 你可以处理文本
代码语言:javascript
复制
doc = nlp("Tea is healthy and calming, don't you think?")

2. Tokenizing

Tokenizing 将返回一个包含 tokens 的 document 对象。 token 是文档中的文本单位,例如单个单词和标点符号。

SpaCy 将像 "don't"这样的缩略语分成两个标记:“do”“n’t”。可以通过遍历文档来查看 token。

代码语言:javascript
复制
for token in doc:
    print(token)

输出:

代码语言:javascript
复制
Tea
is
healthy
and
calming
,
do
n't
you
think
?

3. 文本处理

有几种类型的预处理可以改进我们如何用单词建模。

  • 第一种是 "lemmatizing",一个词的 "lemma"是它的基本形式。 例如,“walk”是单词“walking”"lemma"。所以,当你把walking这个词"lemmatizing"时,你会把它转换成walk
  • 删除stopwords也是很常见的。stopwords是指在语言中经常出现的不包含太多信息的单词。英语的stopwords包括“the”,“is”,“and”,“but”,“not”

token.lemma_返回单词的lemma token.is_stop,如果是停用词,返回布尔值True(否则返回False)

代码语言:javascript
复制
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

在上面的句子中,重要的词是tea, healthy, calming。删除 停用词 可能有助于预测模型关注相关词

Lemmatizing 同样有助于将同一单词的多种形式组合成一个基本形式("calming", "calms", "calmed" 都会转成 "calm")。

然而,Lemmatizing 和 删除停用词 可能会导致模型性能更差。因此,您应该将此预处理视为超参数优化过程的一部分。

4. 模式匹配

另一个常见的NLP任务:在文本块或整个文档中匹配单词或短语。 可以使用正则表达式进行模式匹配,但spaCy的匹配功能往往更易于使用。

要匹配单个tokens令牌,需要创建Matcher匹配器。当你想匹配一个词语列表时,使用PhraseMatcher会更容易、更有效。

例如,如果要查找不同智能手机型号在某些文本中的显示位置,可以为感兴趣的型号名称创建 patterns

  • 首先创建PhraseMatcher
代码语言:javascript
复制
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='lower')

以上,我们使用已经加载过的英语模型的单词进行匹配,并转换为小写后进行匹配

  • 创建要匹配的词语列表
代码语言:javascript
复制
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
print(patterns)
# 输出 [Galaxy Note, iPhone 11, iPhone XS, Google Pixel]
matcher.add("match1", patterns)
# help(matcher.add)
代码语言:javascript
复制
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.") 
for i, text in enumerate(text_doc):
    print(i, text)
matches = matcher(text_doc)
print(matches)

输出:

代码语言:javascript
复制
0 Glowing
1 review
2 overall
3 ,
4 and
5 some
6 really
7 interesting
8 side
9 -
10 by
11 -
12 side
13 photography
14 tests
15 pitting
16 the
17 iPhone
18 11
19 Pro
20 against
21 the
22 Galaxy
23 Note
24 10
25 Plus
26 and
27 last
28 year
29 ’s
30 iPhone
31 XS
32 and
33 Google
34 Pixel
35 3
36 .
[(12981744483764759145, 17, 19), # iPhone 11
(12981744483764759145, 22, 24),  # Galaxy Note
(12981744483764759145, 30, 32),  # iPhone XS
(12981744483764759145, 33, 35)]  # Google Pixel
返回元组(匹配id, 匹配开始位置,匹配结束位置)
代码语言:javascript
复制
match_id, start, end = matches[3]
print(nlp.vocab.strings[match_id], text_doc[start:end])

输出:

代码语言:javascript
复制
match1 Google Pixel

练习:食谱满意度调查

你是DelFalco意大利餐厅的顾问。店主让你确认他们的菜单上是否有令食客失望的食物。

店主建议你使用Yelp网站上的评论来判断人们喜欢和不喜欢哪些菜。你从Yelp那里提取了数据。在开始分析之前,请运行下面的代码单元,快速查看必须使用的数据。

代码语言:javascript
复制
import pandas as pd
data = pd.read_json('../input/nlp-course/restaurant.json')
data.head()

店主还给了你这个菜单项和常见的替代拼写列表

代码语言:javascript
复制
menu = ["Cheese Steak", "Cheesesteak", "Steak and Cheese", "Italian Combo", "Tiramisu", "Cannoli",
        "Chicken Salad", "Chicken Spinach Salad", "Meatball", "Pizza", "Pizzas", "Spaghetti",
        "Bruchetta", "Eggplant", "Italian Beef", "Purista", "Pasta", "Calzones",  "Calzone",
        "Italian Sausage", "Chicken Cutlet", "Chicken Parm", "Chicken Parmesan", "Gnocchi",
        "Chicken Pesto", "Turkey Sandwich", "Turkey Breast", "Ziti", "Portobello", "Reuben",
        "Mozzarella Caprese",  "Corned Beef", "Garlic Bread", "Pastrami", "Roast Beef",
        "Tuna Salad", "Lasagna", "Artichoke Salad", "Fettuccini Alfredo", "Chicken Parmigiana",
        "Grilled Veggie", "Grilled Veggies", "Grilled Vegetable", "Mac and Cheese", "Macaroni",  
         "Prosciutto", "Salami"]

根据Yelp提供的数据和菜单项列表,您有什么想法可以找到哪些菜单项让食客失望?

  • 你可以根据评论中提到的菜单项对其进行分组,然后计算每个项目的平均评分。你可以分辨出哪些食物在评价中被提及得分较低,这样餐馆就可以修改食谱或从菜单中删除这些食物。

1 在评论中找到菜单项

代码语言:javascript
复制
import spacy
from spacy.matcher import PhraseMatcher

index_of_review_to_test_on = 14
text_to_test_on = data.text.iloc[index_of_review_to_test_on]

# Load the SpaCy model
nlp = spacy.blank('en')

# Create the tokenized version of text_to_test_on
review_doc = nlp(text_to_test_on)

# Create the PhraseMatcher object. The tokenizer is the first argument. Use attr = 'LOWER' to make consistent capitalization
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

# Create a list of tokens for each item in the menu
menu_tokens_list = [nlp(item) for item in menu]

# Add the item patterns to the matcher. 
# Look at https://spacy.io/api/phrasematcher#add in the docs for help with this step
# Then uncomment the lines below 


matcher.add("MENU",            # Just a name for the set of rules we're matching to
           menu_tokens_list  
          )

# Find matches in the review_doc
matches = matcher(review_doc)
代码语言:javascript
复制
for i, text in enumerate(review_doc):
    print(i, text)
for match in matches:
   print(f"Token number {match[1]}: {review_doc[match[1]:match[2]]}")
  • 找到了评论中包含食谱中的单词的位置
代码语言:javascript
复制
0 The
1 Il
2 Purista
3 sandwich
4 has
5 become
6 a
7 staple
8 of
9 my
10 life
11 .
12 Mozzarella
13 ,
14 basil
15 ,
16 prosciutto
17 ,
18 roasted
19 red
20 peppers
21 and
22 balsamic
23 vinaigrette
24 blend
25 into
26 a
27 front
28 runner
29 for
30 the
31 best
32 sandwich
33 in
34 the
35 valley
36 .
37 Goes
38 great
39 with
40 sparkling
41 water
42 or
43 a
44 beer
45 .
46 


47 DeFalco
48 's
49 also
50 has
51 other
52 Italian
53 fare
54 such
55 as
56 a
57 delicious
58 meatball
59 sub
60 and
61 classic
62 pastas
63 .
Token number 2: Purista
Token number 16: prosciutto
Token number 58: meatball

2 对所有的评论匹配

  • 每条评论里出现的 食谱(key):[stars 。。。](value),将分数加到列表里
代码语言:javascript
复制
from collections import defaultdict

# item_ratings is a dictionary of lists. If a key doesn't exist in item_ratings,
# the key is added with an empty list as the value.
item_ratings = defaultdict(list) # 字典的值是list

for idx, review in data.iterrows():
    doc = nlp(review.text)
    # Using the matcher from the previous exercise
    matches = matcher(doc)
    
    # Create a set of the items found in the review text
    found_items = set([doc[m[1]:m[2]].lower_ for m in matches])
    
    # Update item_ratings with rating for each item in found_items
    # Transform the item strings to lowercase to make it case insensitive
    for item in found_items:
        item_ratings[item].append(review.stars)

3 最不受欢迎的菜

代码语言:javascript
复制
# Calculate the mean ratings for each menu item as a dictionary
mean_ratings = {name: sum(scores)/len(scores) for name,scores in item_ratings.items()}

# Find the worst item, and write it as a string in worst_text. This can be multiple lines of code if you want.

worst_item = sorted(mean_ratings, key=lambda x : mean_ratings[x])[0]
代码语言:javascript
复制
# After implementing the above cell, uncomment and run this to print 
# out the worst item, along with its average rating. 

print(worst_item)
print(mean_ratings[worst_item])

输出:

代码语言:javascript
复制
chicken cutlet
3.4

4 菜谱出现的次数

  • 每个菜有多少条评论
代码语言:javascript
复制
counts = {item: len(ratings) for item, ratings in item_ratings.items()}

item_counts = sorted(counts, key=counts.get, reverse=True)
for item in item_counts:
    print(f"{item:>25}{counts[item]:>5}")

输出:

代码语言:javascript
复制
                    pizza  265
                    pasta  206
                 meatball  128
              cheesesteak   97
             cheese steak   76
                  cannoli   72
                  calzone   72
                 eggplant   69
                  purista   63
                  lasagna   59
          italian sausage   53
               prosciutto   50
             chicken parm   50
             garlic bread   39
                  gnocchi   37
                spaghetti   36
                 calzones   35
                   pizzas   32
                   salami   28
            chicken pesto   27
             italian beef   25
                 tiramisu   21
            italian combo   21
                     ziti   21
         chicken parmesan   19
       chicken parmigiana   17
               portobello   14
           mac and cheese   11
           chicken cutlet   10
         steak and cheese    9
                 pastrami    9
               roast beef    7
       fettuccini alfredo    6
           grilled veggie    6
               tuna salad    5
          turkey sandwich    5
          artichoke salad    5
                 macaroni    5
            chicken salad    5
                   reuben    4
    chicken spinach salad    2
              corned beef    2
            turkey breast    1
  • 打印出平均打分前十的 和 倒数10个的
代码语言:javascript
复制
sorted_ratings = sorted(mean_ratings, key=mean_ratings.get)

print("Worst rated menu items:")
for item in sorted_ratings[:10]:
    print(f"{item:20} Ave rating: {mean_ratings[item]:.2f} \tcount: {counts[item]}")
    
print("\n\nBest rated menu items:")
for item in sorted_ratings[-10:]:
    print(f"{item:20} Ave rating: {mean_ratings[item]:.2f} \tcount: {counts[item]}")

输出:

代码语言:javascript
复制
Worst rated menu items:
chicken cutlet       Ave rating: 3.40 	count: 10
turkey sandwich      Ave rating: 3.80 	count: 5
spaghetti            Ave rating: 3.89 	count: 36
italian beef         Ave rating: 3.92 	count: 25
tuna salad           Ave rating: 4.00 	count: 5
macaroni             Ave rating: 4.00 	count: 5
italian combo        Ave rating: 4.05 	count: 21
garlic bread         Ave rating: 4.13 	count: 39
roast beef           Ave rating: 4.14 	count: 7
eggplant             Ave rating: 4.16 	count: 69


Best rated menu items:
chicken pesto        Ave rating: 4.56 	count: 27
chicken salad        Ave rating: 4.60 	count: 5
purista              Ave rating: 4.67 	count: 63
prosciutto           Ave rating: 4.68 	count: 50
reuben               Ave rating: 4.75 	count: 4
steak and cheese     Ave rating: 4.89 	count: 9
artichoke salad      Ave rating: 5.00 	count: 5
fettuccini alfredo   Ave rating: 5.00 	count: 6
turkey breast        Ave rating: 5.00 	count: 1
corned beef          Ave rating: 5.00 	count: 2

你对任何特定商品的数据越少,你就越不相信平均评级是客户的“真实”情绪。

我会把评分较低,且评价人数超过20个人的菜撤掉。

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2020/10/14 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 文章目录
  • 1. 使用 spacy 库进行 NLP
  • 2. Tokenizing
  • 3. 文本处理
  • 4. 模式匹配
  • 练习:食谱满意度调查
    • 1 在评论中找到菜单项
      • 2 对所有的评论匹配
        • 3 最不受欢迎的菜
          • 4 菜谱出现的次数
          相关产品与服务
          NLP 服务
          NLP 服务(Natural Language Process,NLP)深度整合了腾讯内部的 NLP 技术,提供多项智能文本处理和文本生成能力,包括词法分析、相似词召回、词相似度、句子相似度、文本润色、句子纠错、文本补全、句子生成等。满足各行业的文本智能需求。
          领券
          问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档