文章/答案/技术大牛

发布

社区首页 >问答首页 >熊猫数据栏值分割

问熊猫数据栏值分割
EN

Stack Overflow用户

提问于 2017-08-24 19:11:41

回答 1查看 720关注 0票数 1

我有一个excel数据集，其中包含用户类型、ID和属性描述。我在dataframe(df)的python大熊猫中导入了这个文件。

现在我想把内容分成一个字，两个字，三个字。我可以在NLTK库的帮助下完成一个单词的标记。但我被两个和三个字标记卡住了。例如，列Description中的一行有句-

孟买主干道上一套全新的住宅公寓，配有便携水。

我想把这句话分割成

"A品牌“、”全新“、”新住宅“、”住宅公寓“.”便携式水“。

这种分裂应该反映在该列的每一行中。

以excel格式显示数据集的图像

python-3.x

pandas

nltk

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-08-24 19:42:10

下面是一个使用来自ngrams的nltk的小例子。希望它有帮助：

from nltk.util import ngrams
from nltk import word_tokenize

# Creating test dataframe
df = pd.DataFrame({'text': ['my first sentence', 
                            'this is the second sentence', 
                            'third sent of the dataframe']})
print(df)

输入dataframe

    text
0   my first sentence
1   this is the second sentence
2   third sent of the dataframe

现在，我们可以将ngram与word_tokenize一起用于bigrams和trigrams，并将其应用于数据each的每一行。对于bigram，我们将2的值与符号化的单词一起传递给ngram函数，而3的值则是传递给trigram的。ngrams返回的结果为generator类型，因此将其转换为list。对于每一行，bigrams和trigrams列表都保存在不同的列中。

df['bigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df['trigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 3)))
print(df)

结果：

                     text  \
0            my first sentence   
1  this is the second sentence   
2  third sent of the dataframe   

                                                   bigram  \
0                            [(my, first), (first, sentence)]   
1  [(this, is), (is, the), (the, second), (second, sentence)]   
2    [(third, sent), (sent, of), (of, the), (the, dataframe)]   

                                                     trigram  
0                                        [(my, first, sentence)]  
1  [(this, is, the), (is, the, second), (the, second, sentence)]  
2     [(third, sent, of), (sent, of, the), (of, the, dataframe)]

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45869287

复制

相似问题

问熊猫数据栏值分割
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫数据栏值分割EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫数据栏值分割
EN