## Python N_gram频率计数内容来源于 Stack Overflow，并遵循CC BY-SA 3.0许可协议进行翻译与使用

• 回答 (2)
• 关注 (0)
• 查看 (258)

``````text_column
This is a book
This is a book that is read
This is a book but he doesn't think this is a book
``````

``````2 gram         Count
This is          3
a book           3
``````

“这是”和“一本书”出现在所有3个文本中，尽管第3个文本中每个都有2个，因为我只感兴趣这些2克出现了多少文件，计数是3而不是4。

### 2 个回答

Pythonic的答案（写得一般所以可以应用于文件/数据框/无论如何）：

``````c=collections.Counter()
for i in fh:
x = i.rstrip().split(" ")
c.update(set(zip(x[:-1],x[1:])))
``````

1. 每一行都是`split`通过空格进入列表。
2. 然后`zip()`返回一个长度为2（2克）的元组的迭代器。
3. 迭代器被输入a `set()`以便去除冗余。
4. 然后将该集合馈入一个`collections.Counter()`对象，该对象跟踪每个元组出现的次数。你需要`import collections`使用它。
5. 现在很容易列出计数器的内容或将其转换为您喜欢的任何其他格式（例如数据帧）。

``````bigram_freq = {}
for doc in df["text_column"]:
cur_bigrams = set()
words = doc.split(" ")
bigrams = zip(words, words[1:])
for bigram in bigrams:
if bigram not in cur_bigrams: # Add bigram, but only once/doc
for bigram in cur_bigrams:
if bigram in bigram_freq:
bigram_freq[bigram] += 1
else:
bigram_freq[bigram] = 1

result_df = pd.DataFrame(columns=["2_gram", "count"])
row_list = []
for bigram, freq in bigram_freq.items():
row_list.append([bigram[0] + " " + bigram[1], freq])
for i in range(len(row_list)):
result_df.loc[i] = row_list[i]

print(result_df)
``````

``````           2_gram count
0          a book     3
1            is a     3
2         This is     3