在对句子进行标记化之后,我尝试使用nltk的FreqDist从我的数据列(文本字符串)中获取最常用的单词。
然而,在应用pandas dataframe之后,我得到了一列列表,而不是字符串。所以在运行时,我得到了错误: TypeError: unhashable type:'list‘
df['Tokenized'] = df['Description'].map(tokenize)
word_dist = nltk.FreqDist(df['Tokenized']) #type error: unhashable type
现在我的标记化是一个列表。我该如何解决这个问题?任何帮助都将不胜感激!
发布于 2019-05-25 06:04:04
TL;DR
nltk.FreqDist
接受字符串列表作为输入。你在熊猫系列赛中进食。
>>> import pandas as pd
>>> from nltk import word_tokenize
>>> from nltk import FreqDist
>>> df = pd.read_csv('x')
>>> df['Description']
0 Here is a sentence.
1 This is a foo bar sentence.
Name: Description, dtype: object
>>> df['Description'].map(word_tokenize)
0 [Here, is, a, sentence, .]
1 [This, is, a, foo, bar, sentence, .]
Name: Description, dtype: object
>>> sum(df['Description'].map(word_tokenize), [])
['Here', 'is', 'a', 'sentence', '.', 'This', 'is', 'a', 'foo', 'bar', 'sentence', '.']
>>> FreqDist(sum(df['Description'].map(word_tokenize), []))
FreqDist({'a': 2, 'sentence': 2, '.': 2, 'is': 2, 'This': 1, 'foo': 1, 'bar': 1, 'Here': 1})
>>> type(df['Description'].map(word_tokenize))
<class 'pandas.core.series.Series'>
>>> type(sum(df['Description'].map(word_tokenize), []))
<class 'list'>
https://stackoverflow.com/questions/56282082
复制相似问题