文章/答案/技术大牛

发布

社区首页 >问答首页 >最常用的单词或短语的FreqDist

问最常用的单词或短语的FreqDist
EN

Stack Overflow用户

提问于 2019-05-24 01:17:43

回答 1查看 407关注 0票数 2

我正在尝试分析一些来自应用程序评论的数据。

我想使用nltk的FreqDist来查看文件中出现频率最高的短语。它可以是单个令牌或关键短语。我不想对数据进行标记化，因为这只会给我最频繁的标记。但是现在，FreqDist函数将每个评论处理为一个字符串，而不是提取每个评论中的单词。

df = pd.read_csv('Positive.csv')

def pre_process(text):
    translator = str.maketrans("", "", string.punctuation)
    text = text.lower().strip().replace("\n", " ").replace("’", "").translate(translator)
    return text

df['Description'] = df['Description'].map(pre_process)
df = df[df['Description'] != '']

word_dist = nltk.FreqDist(df['Description'])

('Description‘是评论的正文/消息。)

例如，我想得到一些最常用的术语：“我喜欢”，“有用的”，“非常好的应用程序”，但我得到的却是最常用的术语：“我真的很喜欢这个应用程序，因为bablabla”(完整的评论)

这就是为什么我在绘制FreqDist时会得到这样的结果：

tokenize

python

machine-learning

nlp

nltk

回答 1

Stack Overflow用户

发布于 2019-05-25 06:10:53

TL;DR

使用ngrams或everygrams

>>> from itertools import chain
>>> import pandas as pd
>>> from nltk import word_tokenize
>>> from nltk import FreqDist

>>> df = pd.read_csv('x')
>>> df['Description']
0            Here is a sentence.
1    This is a foo bar sentence.
Name: Description, dtype: object

>>> df['Description'].map(word_tokenize)
0              [Here, is, a, sentence, .]
1    [This, is, a, foo, bar, sentence, .]
Name: Description, dtype: object

>>> sents = df['Description'].map(word_tokenize).tolist()

>>> FreqDist(list(chain(*[everygrams(sent, 1, 3) for sent in sents])))
FreqDist({('sentence',): 2, ('is', 'a'): 2, ('sentence', '.'): 2, ('is',): 2, ('.',): 2, ('a',): 2, ('Here', 'is', 'a'): 1, ('a', 'foo'): 1, ('a', 'sentence'): 1, ('bar', 'sentence', '.'): 1, ...})

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56280140

复制

相似问题

问最常用的单词或短语的FreqDist
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问最常用的单词或短语的FreqDistEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问最常用的单词或短语的FreqDist
EN