首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何用元组过滤列表中的数据

如何用元组过滤列表中的数据
EN

Stack Overflow用户
提问于 2020-05-24 06:13:35
回答 2查看 202关注 0票数 1

POS标签过滤

代码语言:javascript
运行
复制
# Dummy data

"Sukanya is getting married next year. " \ 
"Marriage is a big step in one’s life." \ 
"It is both exciting and frightening. " \ 
"But friendship is a sacred bond between people." \ 
"It is a special kind of love between us. " \ 
"Many of you must have tried searching for a friend "\ 
"but never found the right one."
代码语言:javascript
运行
复制
import nltk 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, sent_tokenize 
stop_words = set(stopwords.words('english'))

def get_pos_tags(text):
    tokenized = sent_tokenize(txt) 
    for i in tokenized: 

        # Word tokenizers is used to find the words  
        # and punctuation in a string 
        wordsList = nltk.word_tokenize(i) 

        # removing stop words from wordList 
        wordsList = [w for w in wordsList if not w in stop_words]  

        #  Using a Tagger. Which is part-of-speech  
        # tagger or POS-tagger.  
        tagged = nltk.pos_tag(wordsList) 

    return tagged

df["tagged"] = df["text"].apply(lambda x: get_pos_tags(x))

我有数据(Df)。每一行都是一个列表,其中包含元组。

示例行:

代码语言:javascript
运行
复制
[[('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN')],
[('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life', 'NN')],
[('It', 'PRP'), ('exciting', 'VBG'), ('frightening', 'VBG')], 
[('But', 'CC'), ('friendship', 'NN'), ('sacred', 'VBD'), ('bond', 'NN'), ('people', 'NNS')], 
[('It', 'PRP'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'VB'), ('us', 'PRP')], 
[('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'), ('never','RB'),
 ('found', 'VBD'), ('right', 'RB'), ('one', 'CD')]]

现在,我正在尝试将形容词、名词、动词、副词的POS标记过滤到单独的filtered_tags列中

代码语言:javascript
运行
复制
def filter_pos_tags(tagged_text):
    filtererd_tags = []
    for i in tagged_text:
        for j in i:
            if j[-1].startswith(("J", "V", "N", "R")): filtered_tags.append(j[0])
    return filtered_tags

df["filtered_tags"] = df["tagged"].apply(lambda x: get_pos_tags(x))

我得到的输出:

代码语言:javascript
运行
复制
['Sukanya', 'getting', 'married', 'next', 'year', 'Marriage', 'big', 'step', 'life', 'exciting', 'frightening', 'friendship', 'sacred', 'bond', 'people', 'special', 'kind', 'love', 'Many', 'tried', searching', 'friend', 'found', 'right']

所需输出

代码语言:javascript
运行
复制
[['Sukanya', 'getting', 'married', 'next', 'year'], ['Marriage', 'big', 'step', 'life' ], ['exciting', 'frightening'], ['friendship', 'sacred', 'bond', 'people'], ['special', 'kind', 'love'], ['Many', 'tried', searching', 'friend'], ['found', 'right']]
EN

回答 2

Stack Overflow用户

发布于 2020-05-24 06:46:42

试试看:

代码语言:javascript
运行
复制
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer

text = """Sukanya is getting married next year.
Marriage is a big step in one's life.
It is both exciting and frightening.
But friendship is a sacred bond between people.
It is a special kind of love between us.
Many of you must have tried searching for a friend
but never found the right one."""

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def get_pos_tags(text):
    tokenized = sent_tokenize(text)
    for i in tokenized:
        # Word tokenizers is used to find the words
        # and punctuation in a string
        wordsList = nltk.word_tokenize(i)
        # removing stop words from wordList
        wordsList = [w for w in wordsList if not w in stop_words]
        #  Using a Tagger. Which is part-of-speech
        # tagger or POS-tagger.
        tagged = nltk.pos_tag(wordsList, tagset='universal')
    return tagged

def get_filtered(tagged_text):
    valid_tags = set(['ADJ', 'NOUN', 'VERB', 'ADV'])
    filtered = filter(lambda word_entry : lemmatizer.lemmatize(word_entry[1]) in valid_tags, tagged_text)
    final = map(lambda match: match[0], filtered)
    return list(final)

df = pd.DataFrame({
    'text': text.split("\n")
})
df["tagged"] = df["text"].apply(lambda x: get_pos_tags(x))
df['filtered'] = df['tagged'].apply(get_filtered)
print(df['filtered'])

产出如下:

代码语言:javascript
运行
复制
0    [Sukanya, getting, married, next, year]
1                [Marriage, big, step, life]
2                    [exciting, frightening]
3         [friendship, sacred, bond, people]
4                      [special, kind, love]
5     [Many, must, tried, searching, friend]
6                      [never, found, right]
票数 2
EN

Stack Overflow用户

发布于 2020-05-24 06:31:54

如果您更改您的函数以在filtered_tags中的每个项目中添加一个列表,那么您就可以达到预期的效果。

使用下面的filter_pos_tags()函数代替您的函数将使它对您有效。

代码语言:javascript
运行
复制
def filter_pos_tags(tagged_text):
    filtered_tags = []
    for index, i in enumerate(tagged_text):
        filtered_tags.append([])
        for j in i:
            #print(i,j)
            if j[-1].startswith(("J", "V", "N", "R")): filtered_tags[index].append(j[0])
    return filtered_tags

注:

您提供的示例行只有6个元素,在虚拟数据中,似乎有7个句子。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61982286

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档