我在建一个聊天机器人数据库自动取款机。我使用来自pushshift.io的数据。为了处理大数据文件(我知道json
把所有东西都加载到内存中,所以如果你只有16 3GB的内存和30 3GB的数据,这是不可能的),我写了一个bash脚本,它将大文件分成3 3GB的小块文件,这样我就可以通过json.loads
(或pd.read_json
)运行它。问题是每当我运行我的代码时,它都会返回
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
因此,我查看了我刚刚创建的json文件,我看到这发生在我的temp
文件中:
ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
数据的样本校正如下所示
{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
我注意到我的bash脚本拆分了文件,而没有注意JSON对象。所以我的问题是,有没有办法用python编写一个函数来检测没有正确格式化的JSON对象并将其删除?
https://stackoverflow.com/questions/56068674
复制相似问题