文章/答案/技术大牛

发布

社区首页 >问答首页 >改进用于大数据集的Twitter解析器

问改进用于大数据集的Twitter解析器
EN

Code Review用户

提问于 2017-03-30 02:57:40

回答 1查看 1.1K关注 0票数 3

我有以下完全工作的代码

导入JSON文件，
分析JSON中包含的tweet，
将它们记录在数据帧中的表中。

考虑到我目前分析了1,400个JSON(大约1.5Gb)，运行代码需要相当长的时间。请建议是否有一个合理的方法来优化代码，以提高其速度。谢谢!

import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

tweets = []

for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Dir'):
    for file in files:
        if file.endswith('.json'):
            print(file)
            for line in open(file) :
                try:
                    tweet = json.loads(line)
                    tweets.append(tweet)
                except:
                    continue

tweet = tweets[0]

ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet] 
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]                    
place = [tweet['place'] for tweet in tweets if 'place' in tweet]

df=pd.DataFrame({'Ids':pd.Index(ids),
                 'Text':pd.Index(text),
                 'Lang':pd.Index(lang),
                 'Geo':pd.Index(geo),
                 'Place':pd.Index(place)})
df

python

performance

parsing

json

twitter

回答 1

Code Review用户

回答已采纳

发布于 2017-03-30 07:03:41

只是简单地考虑一下：

你有两次import os
你不使用matplotlib和numpy，所以imports可以去
行tweet = tweets[0]是无用的
您不是要关闭打开的文件，应该使用with关键字

两个优化：

我会移除print(file)。这可能是你能做的一个最好的优化。
你已经循环过一次了，为什么还要再循环五次呢？

像这样的东西怎么样(没有测试！)：

from collections import defaultdict

elements_keys = ['ids', 'text', 'lang', 'geo', 'place']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Dir'):
    for file in files:
        if file.endswith('.json'):
            with open(file, 'r') as input_file:
                for line in input_file:
                    try:
                        tweet = json.loads(line)
                        for key in elements_keys:
                            elements[key].append(tweet[key])
                    except:
                        continue

df=pd.DataFrame({'Ids': pd.Index(elements['id']),
                 'Text': pd.Index(elements['text']),
                 'Lang': pd.Index(elements['lang']),
                 'Geo': pd.Index(elements['geo']),
                 'Place': pd.Index(elements['place'])})
df

票数 1

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/159303

复制

相似问题

问改进用于大数据集的Twitter解析器
EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问改进用于大数据集的Twitter解析器EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问改进用于大数据集的Twitter解析器
EN