NLP最强工具包NLTK入门教程

原创

皮大大

发布于 2025-05-09 13:37:52

1.7K0

公众号：尤而小屋编辑：Peter 作者：Peter

大家好，我是Peter~

在当今信息爆炸的时代，自然语言处理（Natural Language Processing, NLP）已成为人工智能领域的重要研究方向之一。无论是机器翻译、情感分析、文本分类，还是语音识别，NLP技术都在其中扮演着关键角色。

而要在NLP领域进行探索和实践，一个强大且易用的工具库是必不可少的。Natural Language Toolkit（NLTK）正是这样一个为研究人员、开发者和学生量身打造的开源Python库。NLTK提供了丰富的文本处理功能，包括分词、词性标注、句法分析、语义推理等，同时还集成了大量的语料库和预训练模型，为自然语言处理任务提供了坚实的基础。

NLTK（Natural Language Toolkit）是一个用于自然语言处理（NLP）的Python库，提供了丰富的工具和资源，帮助开发者处理和分析文本数据。它广泛应用于文本处理、语言学研究和教育领域。

主要功能

文本处理：
- 分词：将文本拆分为单词或句子。
- 词性标注：为每个单词标注词性。
- 词干提取：将单词还原为词干形式。
- 词形还原：将单词还原为基本形式。
文本分析：
- 命名实体识别：识别文本中的人名、地名等实体。
- 句法分析：分析句子结构。
- 语义分析：理解文本的语义。
语料库管理：
- 提供多种语料库和词典，如WordNet、Brown Corpus等。
机器学习集成：
- 支持文本分类、情感分析等任务。

安装

使用以下命令安装NLTK：

pip install nltk

官网学习地址：https://www.nltk.org/api/nltk.html

下载NLTK语料库

NLTK（Natural Language Toolkit）提供了多种内置数据集（语料库），这些数据集可以用于自然语言处理任务的研究、教学和开发。以下是一些常用的内置数据集及其用途：

1. 常用语料库

Gutenberg 语料库：包含经典文学作品，如《圣经》、莎士比亚作品等。用途：文本分析、文学研究。
Brown 语料库：包含多种文体和主题的文本，如新闻、小说、科技等。用途：文体分析、文本分类。
Reuters 语料库：包含路透社新闻文章。用途：文本分类、信息提取。
Inaugural 语料库：包含美国总统就职演说。用途：历史文本分析、语言演变研究。
Web Text 语料库：包含从网络论坛和聊天室收集的文本用途：非正式语言分析。

2. 词典和词汇资源

WordNet：
- 一个英语词汇数据库，提供同义词、反义词、词义解释等。
- 用途：语义分析、词义消歧。
Stopwords：
- 包含常见停用词（如“the”、“is”等）。
- 用途：文本预处理。
Names 语料库：
- 包含常见的人名。
- 用途：命名实体识别。

3. 其他语料库

Movie Reviews 语料库：
- 包含电影评论，标注了正面和负面情感。
- 用途：情感分析。
Treebank 语料库：
- 包含句法标注的句子。
- 用途：句法分析。
Conll2000 语料库：
- 包含分块标注的文本。
- 用途：分块（chunking）任务。

4. 其他资源

CMU Pronunciation Dictionary：
- 包含单词及其发音。
- 用途：语音处理、发音分析。
State of the Union 语料库：
- 包含美国总统的国情咨文演讲。
- 用途：政治文本分析。
Twitter Samples 语料库：
- 包含从 Twitter 收集的推文样本。
- 用途：社交媒体文本分析。

import nltk
nltk.download()

在Windows系统中会进入一个安装界面，点击install，等待全部包的安装完成。

# 下载指定的语料库

nltk.download('gutenberg')  # 下载 Gutenberg 语料库
nltk.download('wordnet')     # 下载 WordNet
nltk.download('stopwords')   # 下载停用词

应用1：分词Tokenizing

在分析文本时，可以按单词分词和按句子分词。以下是这两种分词方式的作用：

按单词分词：

将文本拆分为单个单词或符号。
适用于词频统计、词性标注、词干提取等任务。

按句子分词：

将文本拆分为完整的句子。
适用于句子级别的分析，如句法分析、语义分析或机器翻译。

单词分词

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize  # 单词分词和句子分词

给定一个文本：

example_string = """
Muad'Dib learned rapidly because his first training was in how to learn.
And the first lesson of all was the basic trust that he could learn.
It's shocking to find how many people do not believe they can learn,
and how many more believe learning to be difficult.
"""

word_tokenize(example_string)

["Muad'Dib",
'learned',
'rapidly',
'because',
'his',
'first',
'training',
'was',
'in',
'how',
'to',
'learn',
'.',
'And',
'the',
'first',
'lesson',
'of',
'all',
'was',
'the',
'basic',
'trust',
'that',
'he',
'could',
'learn',
'.',
'It',
"'s",
'shocking',
'to',
'find',
'how',
'many',
'people',
'do',
'not',
'believe',
'they',
'can',
'learn',
',',
'and',
'how',
'many',
'more',
'believe',
'learning',
'to',
'be',
'difficult',
'.']

从结果中可以看到，NLTK根据单词分词后得到了一个由单词组成的列表。

句子分词

sent_tokenize(example_string)

["\nMuad'Dib learned rapidly because his first training was in how to learn.",
'And the first lesson of all was the basic trust that he could learn.',
"It's shocking to find how many people do not believe they can learn,\nand how many more believe learning to be difficult."]

基于句子分词后得到的是由3个句子组成的列表。

应用2：删除停用词filter stopwords

在自然语言处理（NLP）中，停用词（Stop Words）是指在文本分析中被忽略的常见词语。这些词语通常对文本的含义贡献较小，但在文本中出现的频率非常高。过滤掉停用词可以减少噪声，提高文本处理和分析的效率。

常用停用词

常见的停用词示例

中文：的、了、是、在、和、就、我、你、他、这、那

英文：the, a, an, in, on, is, are, and, of, for

案例

nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

sentence = "Sir, I protest. I am not a merry man!"

# 实施分词
words_in_sentence = word_tokenize(sentence)
words_in_sentence

['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man', '!']

导入英文停用词：

stop_words = set(stopwords.words("english"))  # {'a', 'about', 'above','after','again'...}

filtered_list = []

for word in words_in_sentence:
    if word.casefold() not in stop_words:  # 如果word不在停用词列表中，进行添加
        filtered_list.append(word)

filtered_list

['Sir', ',', 'protest', '.', 'merry', 'man', '!']

可以看到I、am、not、a被删除了。

应用3：词干提取Stemming

词干提取Stemming是一种文本处理任务，目的是将单词还原为其词干形式。词干是单词的核心部分，通常不包含词缀（如前缀、后缀）。例如：单词 “helping” 和 “helper” 的词干都是 “help”。

通过词干提取，可以聚焦于单词的基本含义，而不是其具体的使用形式。

词干提取作用

归一化单词：将不同形式的单词还原为同一词干，减少词汇的多样性。
提高文本分析效率：减少需要处理的词汇量，从而降低计算复杂度。
适用于信息检索和文本分类：在搜索引擎、文本分类等任务中，词干提取可以提高结果的准确性。

案例

NLTK 提供了多种词干提取器，其中最常用的是Porter 词干提取器或者Snowball stemmer（也称之为Porter2）。它们专门用于英文文本，能够有效地将单词还原为词干。

import nltk
from nltk.stem import PorterStemmer,SnowballStemmer  # 使用 PorterStemmer或SnowballStemmer类
from nltk.tokenize import word_tokenize # 单词分词

# 下载必要的数据包（第一次运行时需要）
nltk.download('punkt')

# 示例文本
text = "Helping helpers help others by providing helpful help."

# 按单词分词
words = word_tokenize(text)

# 初始化 Porter 词干提取器
stemmer1 = PorterStemmer()
# 初始化 Snowball 词干提取器（以英语为例）
stemmer2 = SnowballStemmer('english')

# 对每个单词进行词干提取
stemmed_words1 = [stemmer1.stem(word) for word in words]
stemmed_words2 = [stemmer2.stem(word) for word in words]

# 输出结果
print("原始分词结果:", words)
print("词干提取结果1:", stemmed_words1)
print("词干提取结果2:", stemmed_words2)

原始分词结果: ['Helping', 'helpers', 'help', 'others', 'by', 'providing', 'helpful', 'help', '.']
词干提取结果1: ['help', 'helper', 'help', 'other', 'by', 'provid', 'help', 'help', '.']
词干提取结果2: ['help', 'helper', 'help', 'other', 'by', 'provid', 'help', 'help', '.']

应用4：词性标注POS Tagging（Part of Speech Tagging）

词性（Part of Speech）是一个语法术语，用于描述单词在句子中的作用。词性标注（POS Tagging）是将文本中的每个单词标注为对应词性的任务

英语中的八种主要词性

名词Noun
代词Pronoun
动词Verb
形容词Adjective
副词Adverb
介词Preposition
连词Conjunction
感叹词Interjection

词性标注的作用

句法分析：帮助理解句子的结构和语法规则。
语义分析：辅助理解单词在句子中的具体含义。
信息提取：用于提取名词、动词等关键信息。
机器翻译：提高翻译的准确性等

案例

import nltk
from nltk.tokenize import word_tokenize

# 下载必要的数据包（第一次运行时需要）
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

首先进行分词：

# 示例文本
text = "I love learning natural language processing with NLTK."
# 按单词分词
words = word_tokenize(text)

分词之后对每个词进行词性标注：

# 进行词性标注
pos_tags = nltk.pos_tag(words)

# 输出结果
print("词性标注结果:", pos_tags)

    词性标注结果: [('I', 'PRP'), ('love', 'VBP'), ('learning', 'VBG'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('with', 'IN'), ('NLTK', 'NNP'), ('.', '.')]

NLTK词性标签

NLTK 提供了 nltk.help.upenn_tagset() 方法，可以查看所有标签及其含义

import nltk

# 下载必要的数据包（第一次运行时需要）
nltk.download('tagsets')
# 查看特定标签的含义
nltk.help.upenn_tagset('NN')  # 查看名词标签的含义
nltk.help.upenn_tagset('VB')  # 查看动词标签的含义

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...

# 查看所有标签及其含义
nltk.help.upenn_tagset()

所有词性标签的含义：

标签	含义	示例
CC	并列连词（Coordinating conjunction）	`and`、`but`
CD	基数词（Cardinal number）	`1`、`two`
DT	限定词（Determiner）	`the`、`a`
EX	存在句中的 `there`（Existential there）	`there`
FW	外来词（Foreign word）	`déjà vu`
IN	介词或从属连词（Preposition/Subordinating conjunction）	`in`、`because`
JJ	形容词（Adjective）	`happy`、`big`
JJR	形容词，比较级（Adjective, comparative）	`happier`、`bigger`
JJS	形容词，最高级（Adjective, superlative）	`happiest`、`biggest`
LS	列表项标记（List item marker）	`1)`、`A.`
MD	情态动词（Modal）	`can`、`will`
NN	名词，单数或不可数（Noun, singular or mass）	`dog`、`happiness`
NNS	名词，复数（Noun, plural）	`dogs`、`cities`
NNP	专有名词，单数（Proper noun, singular）	`John`、`London`
NNPS	专有名词，复数（Proper noun, plural）	`Americans`、`Germans`
PDT	前位限定词（Predeterminer）	`all`、`both`
POS	所有格标记（Possessive ending）	`'s`
PRP	人称代词（Personal pronoun）	`I`、`he`
PRP\$	物主代词（Possessive pronoun）	`my`、`his`
RB	副词（Adverb）	`quickly`、`very`
RBR	副词，比较级（Adverb, comparative）	`faster`、`better`
RBS	副词，最高级（Adverb, superlative）	`fastest`、`best`
RP	小品词（Particle）	`up`、`off`
SYM	符号（Symbol）	`+`、`%`
TO	介词 `to`（To）	`to`
UH	感叹词（Interjection）	`oh`、`wow`
VB	动词，原形（Verb, base form）	`run`、`eat`
VBD	动词，过去式（Verb, past tense）	`ran`、`ate`
VBG	动词，动名词/现在分词（Verb, gerund/present participle）	`running`、`eating`
VBN	动词，过去分词（Verb, past participle）	`eaten`、`written`
VBP	动词，非第三人称单数现在时（Verb, non-3rd person singular present）	`run`、`eat`
VBZ	动词，第三人称单数现在时（Verb, 3rd person singular present）	`runs`、`eats`
WDT	疑问限定词（Wh-determiner）	`which`、`what`
WP	疑问代词（Wh-pronoun）	`who`、`what`
WP\$	所有格疑问代词（Possessive wh-pronoun）	`whose`
WRB	疑问副词（Wh-adverb）	`where`、`when`

应用5：词形还原Lemmatizing

什么是词形还原

词形还原是将单词还原为其词元（Lemma）的过程。词元是单词的基本形式，通常是一个完整的、有意义的单词。例如：

running → run
better → good
geese → goose

和词干提取的对比：

词干提取是将单词缩减为其词干（Stem）的过程。词干是单词的核心部分，可能不是一个完整的单词。例如：

running → run
better → better
discovery → discoveri

词形还原vs 词干提取

以下是两者的主要区别：

特性	词形还原（Lemmatization）	词干提取（Stemming）
输出	完整的单词（词元）	可能是单词片段
语义	保留语义信息	可能丢失语义信息
依赖词性	需要词性标注	无需词性标注
复杂度	较高	较低

案例1

import nltk
from nltk.stem import WordNetLemmatizer  # 初始化词形还原器，默认还原为名词n

# 下载必要的数据包（第一次运行时需要）
nltk.download('wordnet')
nltk.download('omw-1.4')

# 初始化词形还原器
lemmatizer = WordNetLemmatizer()

# 示例单词
words = ["running", "better", "cats", "geese"]

# 对每个单词进行词形还原
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# 输出结果
print("原始单词:", words)
print("词形还原结果:", lemmatized_words)

原始单词: ['running', 'better', 'cats', 'geese']
词形还原结果: ['running', 'better', 'cat', 'goose']

案例2

可以通过指定词性来改进词形还原的效果。以下是改进后的代码：

# 对每个单词进行词形还原（指定词性）
lemmatized_words = [
    lemmatizer.lemmatize(word, pos='v')  # 指定词性为动词
    for word in words
]

# 输出结果
print("改进后的词形还原结果:", lemmatized_words)

改进后的词形还原结果: 'run', 'better', 'cat', 'geese'

再比如：

print("未指定词性：",lemmatizer.lemmatize("worst"))  # 默认还原为名词
print("指定词性：", lemmatizer.lemmatize("worst",pos="a"))  # 指定还原为形容词

未指定词性： worst
指定词性： bad

案例3

对比词形还原和词干提取

import nltk
from nltk.stem import PorterStemmer,SnowballStemmer,WordNetLemmatizer  
from nltk.tokenize import word_tokenize # 单词分词

text = "The cats are running in the field and the geese are flying."

word_token = word_tokenize(text)

1、进行词干提取操作：

# 词干提取Stemming: Porter 词干提取器
stemmer1 = PorterStemmer()
# 对每个单词进行词干提取
stemmed_words = [stemmer1.stem(word) for word in word_token]

2、进行词形还原操作：

# 初始化词形还原器
lemmatizer = WordNetLemmatizer()
# 对每个单词进行词形还原
lemmatized_words = [lemmatizer.lemmatize(word) for word in word_token]

对比二者的结果：

print("词干提取Stemming: ",stemmed_words)
print("词形还原Lemmatizer: ",lemmatized_words)

词干提取Stemming:  ['the', 'cat', 'are', 'run', 'in', 'the', 'field', 'and', 'the', 'gees', 'are', 'fli', '.']
词形还原Lemmatizer:  ['The', 'cat', 'are', 'running', 'in', 'the', 'field', 'and', 'the', 'goose', 'are', 'flying', '.']

应用6：分块Chunking

分块（Chunking）是一种将文本划分为短语（Phrases）的任务。

与分词（Tokenizing）不同，分词是将文本划分为单词或句子，而分块则是将单词组合成有意义的短语。

分块作用

提取短语：从文本中提取名词短语（Noun Phrases）、动词短语（Verb Phrases）等。
句法分析：帮助理解句子的语法结构。
信息提取：用于提取关键信息，如人名、地名、日期等。

常见分块类型

常见的分块类型包括：

名词短语（NP）：例如 the cat、a big house。
动词短语（VP）：例如 is running、has been completed。
介词短语（PP）：例如 in the house、with a smile。

案例

import nltk
from nltk import pos_tag, word_tokenize
from nltk.chunk import RegexpParser

sentence_chunk = "It's a dangerous business, Frodo, going out your door."
# 分词
word_sentence_chunk = word_tokenize(sentence_chunk)

进行词性标注：

nltk.download("averaged_perceptron_tagger")
# 词性标注
chunk_pos = nltk.pos_tag(word_sentence_chunk)
chunk_pos

[('It', 'PRP'),
("'s", 'VBZ'),
('a', 'DT'),
('dangerous', 'JJ'),
('business', 'NN'),
(',', ','),
('Frodo', 'NNP'),
(',', ','),
('going', 'VBG'),
('out', 'RP'),
('your', 'PRP$'),
('door', 'NN'),
('.', '.')]

通过正则表达式定义分块规则：

# 定义名词短语（NP）的分块规则：可选限定词 + 零或多个形容词 + 名词
chunk_rule = """
NP: {<DT>?<JJ>*<NN>}  
"""

# 创建分块解析器
chunk_parser = RegexpParser(chunk_rule)

# 对词性标注结果进行分块
chunked_text = chunk_parser.parse(chunk_pos)

# 输出结果
print("词性标注结果:", chunk_pos) 
print("分块结果:")
chunked_text.pretty_print() # 显示分块后的短语结构，以标准格式输出

词性标注结果: [('It', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('dangerous', 'JJ'), ('business', 'NN'), (',', ','), ('Frodo', 'NNP'), (',', ','), ('going', 'VBG'), ('out', 'RP'), ('your', 'PRP$'), ('door', 'NN'), ('.', '.')]

通过树形的形式进行可视化输出：

chunked_text.draw()  # 显示树形结构

排除分块Chinking

定义

在自然语言处理（NLP）中，Chinking 是一种与 Chunking（分块）结合使用的技术，用于从分块结果中排除某些不需要的部分。Chinking 的作用是精细化分块结果，使其更符合任务需求。

包含（Chunking）：定义需要保留的短语模式。
排除（Chinking）：定义需要排除的短语模式。

案例

import nltk
from nltk import pos_tag, word_tokenize
from nltk.chunk import RegexpParser

sentence_chunk = "It's a dangerous business, Frodo, going out your door."
# 分词
word_sentence_chunk = word_tokenize(sentence_chunk)
nltk.download("averaged_perceptron_tagger")
# 词性标注
chunk_pos = nltk.pos_tag(word_sentence_chunk)
chunk_pos

[('It', 'PRP'),
("'s", 'VBZ'),
('a', 'DT'),
('dangerous', 'JJ'),
('business', 'NN'),
(',', ','),
('Frodo', 'NNP'),
(',', ','),
('going', 'VBG'),
('out', 'RP'),
('your', 'PRP$'),
('door', 'NN'),
('.', '.')]

分块和排除分块的联合使用：

# 定义规则
chunk_rule = """
Chunk: {<.*>+}  # 分块：包含所有单词
}<JJ>{  # 排除分块：排除形容词
"""

# 创建分块解析器
chunk_parser = RegexpParser(chunk_rule)

# 对词性标注结果进行分块
chunked_text = chunk_parser.parse(chunk_pos) 
chunked_text

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

NLP技术

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

NLP技术