情态动词是一种表示关于动作语义信息的助动词,即可能性(will,should),许可(could,may),义务(shall/must)。一个值得探究的有趣想法是:这些动词的存在是否因不同类型的文本而不同,并且这是否意味着什么。
“ 使用Python进行自然语言处理 ”(阅读我的评论)中有一个说明如何开始这个研究过程的例子,我们使用布朗语料库比较不同类型文本中的动词频率,这是60年代用于语言研究的著名文本集合。
我扩展了这个示例,使用了包括额外的法庭案件和额外的辅助动词,约15,000法律文件内容。
首先,我们定义一个检索文献体裁的函数,然后从体裁中检索词语。对于法律文件,我从我以前建立的n-gram [链接](即单词/短语计数)中读取。
import nltk import os def get_genres(): yield 'legal' for genre in brown.categories(): yield genre modals = ['can', 'could', 'may', 'might', 'must', 'will', 'would', 'should'] def get_words(genre): if (genre == 'legal'): grams = open('1gram', 'rU') for line in grams: vals = line.split(' ') word = vals[0] count = int(vals[1]) if (word in modals): for index in range(0, count): yield word else: yield word grams.close() else: for word in brown.words(categories=genre): yield word |
---|
自然语言工具包提供了一个跟踪“实验”结果频率的类,在这里我们对使用不同的动词时态进行跟踪。
cfd = nltk.ConditionalFreqDist( (genre, word) for genre in get_genres() for word in get_words(genre) ) genres = [g for g in get_genres()] cfd.tabulate(conditions=genres, samples=modals) cfd.tabulate(conditions=genres, samples=modals) |
---|
这个列表方法是由NTLK提供,并制作一个格式良好的图表(在命令行中它可以使所有内容整齐排列)
can | could | may | might | must | will | would | should | |
---|---|---|---|---|---|---|---|---|
legal | 13059 | 7849 | 26968 | 1762 | 15974 | 20757 | 19931 | 13916 |
adventure | 46 | 151 | 5 | 58 | 27 | 50 | 191 | 15 |
belles_lettres | 246 | 213 | 207 | 113 | 170 | 236 | 392 | 102 |
editorial | 121 | 56 | 74 | 39 | 53 | 233 | 180 | 88 |
fiction | 37 | 166 | 8 | 44 | 55 | 52 | 287 | 35 |
government | 117 | 38 | 153 | 13 | 102 | 244 | 120 | 112 |
hobbies | 268 | 58 | 131 | 22 | 83 | 264 | 78 | 73 |
humor | 16 | 30 | 8 | 8 | 9 | 13 | 56 | 7 |
learned | 365 | 159 | 324 | 128 | 202 | 340 | 319 | 171 |
lore | 170 | 141 | 165 | 49 | 96 | 175 | 186 | 76 |
mystery | 42 | 141 | 13 | 57 | 30 | 20 | 186 | 29 |
news | 93 | 86 | 66 | 38 | 50 | 389 | 244 | 59 |
religion | 82 | 59 | 78 | 12 | 54 | 71 | 68 | 45 |
reviews | 45 | 40 | 45 | 26 | 19 | 58 | 47 | 18 |
romance | 74 | 193 | 11 | 51 | 45 | 43 | 244 | 32 |
science_fiction | 16 | 49 | 4 | 12 | 8 | 16 | 79 | 3 |
透过这些数字,显然我们需要添加一个标准化的概念。我添加的语料库比布朗语料库有更多的符号,这使得两者很难进行比较。
频率分布类用于计算事物,而且我找不到对行进行标准化的好方法。于是我重新编写了列表函数,让它找到每行的最大值,然后除以100,再乘以100。
def tabulate(cfd, conditions, samples): max_len = max(len(w) for w in conditions) sys.stdout.write(" " * (max_len + 1)) for c in samples: sys.stdout.write("%-s\t" % c) sys.stdout.write("\n") for c in conditions: sys.stdout.write(" " * (max_len - len(c))) sys.stdout.write("%-s" % c) sys.stdout.write(" ") dist = cfd[c] norm = sum([dist[w] for w in modals]) for s in samples: value = 100 * dist[s] / norm sys.stdout.write("%-d\t" % value) sys.stdout.write("\n") tabulate(cfd, genres, modals) |
---|
这样可以更容易地分析列表内容。
can | could | may | might | must | will | would | should | |
---|---|---|---|---|---|---|---|---|
legal | 10 | 6 | 22 | 1 | 13 | 17 | 16 | 11 |
adventure | 8 | 27 | 0 | 10 | 4 | 9 | 35 | 2 |
belles_lettres | 14 | 12 | 12 | 6 | 10 | 14 | 23 | 6 |
editorial | 14 | 6 | 8 | 4 | 6 | 27 | 21 | 10 |
fiction | 5 | 24 | 1 | 6 | 8 | 7 | 41 | 5 |
government | 13 | 4 | 17 | 1 | 11 | 27 | 13 | 12 |
hobbies | 27 | 5 | 13 | 2 | 8 | 27 | 7 | 7 |
humor | 10 | 20 | 5 | 5 | 6 | 8 | 38 | 4 |
learned | 18 | 7 | 16 | 6 | 10 | 16 | 15 | 8 |
lore | 16 | 13 | 15 | 4 | 9 | 16 | 17 | 7 |
mystery | 8 | 27 | 2 | 11 | 5 | 3 | 35 | 5 |
news | 9 | 8 | 6 | 3 | 4 | 37 | 23 | 5 |
religion | 17 | 12 | 16 | 2 | 11 | 15 | 14 | 9 |
reviews | 15 | 13 | 15 | 8 | 6 | 19 | 15 | 6 |
romance | 10 | 27 | 1 | 7 | 6 | 6 | 35 | 4 |
science_fiction | 8 | 26 | 2 | 6 | 4 | 8 | 42 | 1 |
可以看到很明显的一点是,大多数体裁有大量的“would”,很少有“should”。
在1到10范围观察这些数字会更好,可以看到列的值在长度上传达了一些东西。
def tabulate(cfd, conditions, samples): max_len = max(len(w) for w in conditions) sys.stdout.write(" " * (max_len + 1)) for c in samples: sys.stdout.write("%-s\t" % c) sys.stdout.write("\n") for c in conditions: sys.stdout.write(" " * (max_len - len(c))) sys.stdout.write("%-s" % c) sys.stdout.write(" ") dist = cfd[c] norm = sum([dist[w] for w in modals]) for s in samples: value = 10 * float(dist[s]) / norm sys.stdout.write("%.1f\t" % value) sys.stdout.write("\n") tabulate(cfd, genres, modals) |
---|
can | could | may | might | must | will | would | should | |
---|---|---|---|---|---|---|---|---|
legal | 1.1 | 0.7 | 2.2 | 0.1 | 1.3 | 1.7 | 1.7 | 1.2 |
adventure | 0.8 | 2.8 | 0.1 | 1.1 | 0.5 | 0.9 | 3.5 | 0.3 |
belles_lettres | 1.5 | 1.3 | 1.2 | 0.7 | 1.0 | 1.4 | 2.3 | 0.6 |
editorial | 1.4 | 0.7 | 0.9 | 0.5 | 0.6 | 2.8 | 2.1 | 1.0 |
fiction | 0.5 | 2.4 | 0.1 | 0.6 | 0.8 | 0.8 | 4.2 | 0.5 |
government | 1.3 | 0.4 | 1.7 | 0.1 | 1.1 | 2.7 | 1.3 | 1.2 |
hobbies | 2.7 | 0.6 | 1.3 | 0.2 | 0.8 | 2.7 | 0.8 | 0.7 |
humor | 1.1 | 2.0 | 0.5 | 0.5 | 0.6 | 0.9 | 3.8 | 0.5 |
learned | 1.8 | 0.8 | 1.6 | 0.6 | 1.0 | 1.7 | 1.6 | 0.9 |
lore | 1.6 | 1.3 | 1.6 | 0.5 | 0.9 | 1.7 | 1.8 | 0.7 |
mystery | 0.8 | 2.7 | 0.3 | 1.1 | 0.6 | 0.4 | 3.6 | 0.6 |
news | 0.9 | 0.8 | 0.6 | 0.4 | 0.5 | 3.8 | 2.4 | 0.6 |
religion | 1.7 | 1.3 | 1.7 | 0.3 | 1.2 | 1.5 | 1.4 | 1.0 |
reviews | 1.5 | 1.3 | 1.5 | 0.9 | 0.6 | 1.9 | 1.6 | 0.6 |
romance | 1.1 | 2.8 | 0.2 | 0.7 | 0.6 | 0.6 | 3.5 | 0.5 |
science_fiction | 0.9 | 2.6 | 0.2 | 0.6 | 0.4 | 0.9 | 4.2 | 0.2 |
很高兴看到这些体裁有多相似,我们可以用将模态计数想象成向量的描述。向量之间的角度看做“相似性”。好处在于,它可以去除其他单词(可能只存在于一个文本中的单词,其中一些将归因于数据清理得如何,这并不反映文献体裁)。
import math def distance(cfd, conditions, samples, base): base_cond = cfd[base] base_vector = [base_cond[w] for w in samples] base_length = math.sqrt(sum(a * a for a in base_vector)) for c in conditions: cond = cfd[c] cond_vector = [cond[w] for w in samples] dotp = sum(a * b for (a,b) in zip(base_vector, cond_vector)) cond_length = math.sqrt(sum(a * a for a in cond_vector)) angle = math.acos(dotp / (cond_length * base_length)) percent = (math.pi / 2 - angle) / (math.pi / 2) * 100 print "%-s similarity to %-s: %-.1f" % (c, base, percent) |
---|
实验结果很有趣,在这种情况下,最接近合法的体是‘government’和‘religion’。
有一个有趣的发现,belles_lettres的意思是“精美的写作”,即诗歌,戏剧,小说。
legal similarity to legal: 100.0
adventure similarity to legal: 41.6
belles_lettres similarity to legal: 72.4
editorial similarity to legal: 68.8
fiction similarity to legal: 42.9
government similarity to legal: 80.6
hobbies similarity to legal: 63.5
humor similarity to legal: 50.1
learned similarity to legal: 80.6
lore similarity to legal: 78.6
mystery similarity to legal: 41.3
news similarity to legal: 58.1
religion similarity to legal: 81.2
reviews similarity to legal: 73.5
romance similarity to legal: 42.9
science_fiction similarity to legal: 41.8
有些体裁看起来与法律文件相似,但有可能有些动词不是独立的。例如,你可能会看到“may”和“might”具有平等的相似性。测试这种情况的一种方法可能是去翻转我们追踪的距离(为每个模式创建一个向量,而不是体裁)
下面的代码跟踪每个模态和平均值之间的距离,使用不同的体裁作为尺寸。由于它们中的每一个对平均值都有所贡献,所有它们之间会有一些相似性,但要注意的是,有些比其他更相似。还要注意,必须对它们进行标准化,就像最后一个例子一样,否则答案将由'legal'体裁定义。
def distance(cfd, conditions, samples): base_vector = [0.0 for w in conditions] norm = {} for c_i in range(0, len(conditions)): cond_name = conditions[c_i] cond = cfd[cond_name] norm[cond_name] = float(sum(cond[s] for s in samples)) for s in samples: base_vector[c_i] = base_vector[c_i] + float(cond[s]) / norm[cond_name] base_length = math.sqrt(sum(a * a for a in base_vector)) for s in samples: # compute each vector - which, might, etc sample_vector = [] for c in conditions: # find condition for each vector sample_vector.append(cfd[c][s] / norm[c]) dotp = sum(a * b for (a,b) in zip(base_vector, sample_vector)) sample_length = math.sqrt(sum(a * a for a in sample_vector)) angle = math.acos(dotp / (sample_length * base_length)) percent = (math.pi / 2 - angle) / (math.pi / 2) * 100 print "%-s similarity to mean: %-.1f" % (s, percent) distance(cfd, genres, modals) |
---|
我从中推断出,区分体裁最帮助最少的动词是“must”,而最有用的是“may”。
can similarity to mean: 76.0
could similarity to mean: 67.6
may similarity to mean: 61.5
might similarity to mean: 70.0
must similarity to mean: 79.7
will similarity to mean: 67.7
would similarity to mean: 73.6
should similarity to mean: 74.2