文章/答案/技术大牛

发布

社区首页 >问答首页 >删除二维数组python中的重复项

问删除二维数组python中的重复项
EN

Stack Overflow用户

提问于 2015-08-21 06:49:20

回答 2查看 92关注 0票数 0

我正在尝试删除字典中的重复项，但仅基于文本值中的重复项

例如，我想删除这个tweet列表中的重复内容：

{'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://example.com/EcSHCAm9Nn', 'id': 634092907243393024L}
{'text': 'RT Iran deal opponents now have their "death panels" lie, and it\'s a whopper http://example.com/ntECOXorvK via @voxdotcom #IranDeal', 'id': 634068454207791104L}
{'text': 'RT : Iran deal quietly picks up some GOP backers via https://example.com/65DRjWT6t8 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633631425279991812L}
{'text': 'RT : Iran deal quietly picks up some GOP backers via https://example.com/QD43vbJft6 catoletters: Iran deal quietly picks up some GOP backers \xe2\x80\xa6', 'id': 633495091584323584L}
{'text': "RT : Iran Deal's Surprising Supporters: https://example.com/pUG7vht0fE catoletters: Iran Deal's Surprising Supporters: http://example.com/dhdylTNgoG", 'id': 633083989180448768L}
{'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://example.com/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://example.com/sTBhL12llF", 'id': 632525323733729280L}
{'text': "RT : Iran Deal's Surprising Supporters - Today on the Liberty Report: https://example.com/PVHuVTyuAG RonPaul: Iran Deal'\xe2\x80\xa6 https://example.com/sTBhL12llF", 'id': 632385798277595137L}
{'text': "RT : Iran Deal's Surprising Supporters: https://example.com/hOUCmreHKA catoletters: Iran Deal's Surprising Supporters: http://example.com/bJSLhd9dqA", 'id': 632370745088323584L}
{'text': '#News #RT Iran deal debate devolves into clash over Jewish stereotypes and survival - W... http://example.com/foU0Sz6Jej http://example.com/WvcaNkMcu3', 'id': 631952088981868544L}
{'text': '"@JeffersonObama: RT Iran deal support from Democratic senators is 19-1 so far....but...but Schumer...."', 'id': 631951056189149184L}}

要获得以下信息：

{'text': 'Dear Conservatives: comprehend, if you can RT Iran deal opponents have their "death panels" lie, and it\'s a whopper http://example.com/EcSHCAm9Nn', 'id': 634092907243393024L}
{'text': '"@JeffersonObama: RT Iran deal support from Democratic senators is 19-1 so far....but...but Schumer...."', 'id': 631951056189149184L}}

到目前为止，我找到的答案大多基于“普通”字典，其中重复的键/值是相同的。在我的例子中，它是一个合并的字典。由于转发，文本键相同，但对应的tweet id不同

这是完整的代码，任何关于以更有效的方式(使删除副本更容易)在csv文件中编写tweet的技巧都是非常受欢迎的。

import csv
import codecs
tweet_text_id = []

from TwitterSearch import TwitterSearchOrder, TwitterUserOrder,    TwitterSearchException, TwitterSearch
try:
tso = TwitterSearchOrder() 
tso.set_keywords(["Iran Deal"]) 
tso.set_language('en')
tso.set_include_entities(False) 



ts = TwitterSearch(
    consumer_key = "aaaaa",
    consumer_secret = "bbbbb",
    access_token = "cccc",
    access_token_secret = "dddd"
 )

for tweet in ts.search_tweets_iterable(tso):
    tweet_text_id.append({'id':tweet['id'], 'text': tweet['text'].encode('utf8')});



fieldnames = ['id', 'text']
tweet_file = open('tweets.csv', 'wb')
csvwriter = csv.DictWriter(tweet_file, delimiter=',', fieldnames=fieldnames)
csvwriter.writerow(dict((fn,fn) for fn in fieldnames))
for row in tweet_text_id:
    csvwriter.writerow(row)
tweet_file.close()

except TwitterSearchException as e: 
     print(e)

duplicates

python

twitter

Stack Overflow用户

发布于 2015-08-21 07:22:29

我制作了一个模块，可以过滤掉重复的实例，并在此过程中删除标签。

__all__ = ['filterDuplicates']
import re

hashRegex = re.compile(r'#[a-z0-9]+', re.IGNORECASE)
trunOne = re.compile(r'^\s+')
trunTwo = re.compile(r'\s+$')

def filterDuplicates(tweets):

    dupes = []
    new_dict = []

    for dic in tweets:
        new_txt = hashRegex.sub('', dic['text']) #Removes hashtags
        new_txt = trunOne.sub('', trunTwo.sub('', new_txt)) #Truncates extra spaces

        print(new_txt)

        dic.update({'text':new_txt})

        if new_txt in dupes:
            continue

        dupes.append(new_txt)
        new_dict.append(dic)

    return new_dict

if __name__ == '__main__':

    the_tweets = [
        {'text':'#yolo #swag something really annoying', 'id':1},
        {'text':'something really annoying', 'id':2},
        {'text':'thing thing thing haha', 'id':3},
        {'text':'#RF thing thing thing haha', 'id':4},
        {'text':'thing thing thing haha', 'id':5}
    ]

    #Tweets pre-filter
    for dic in the_tweets:
        print(dic)

    #Tweets post-filter
    for dic in filterDuplicates(the_tweets):
        print(dic)

只需将其导入您的脚本并运行它来过滤掉tweet！

票数 0

查看全部 2 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32129701

复制

相似问题

问删除二维数组python中的重复项
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问删除二维数组python中的重复项EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问删除二维数组python中的重复项
EN