来自TMDB csv文件的片段:
movie_id,title,cast,crew
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3}, {""cast_id"": 5, ""character"": ""Trudy Chacon"", ""credit_id"": ""52fe48009251416c750ac9d3"", ""gender"": 1, ""id"": 17647, ""name"": ""Michelle Rodriguez"", ""order"": 4}, {""cast_id"": 8, ""character"": ""Selfridge"", ""credit_id"": ""52fe48009251416c750ac9e1"", ""gender"": 2, ""id"": 1771, ""name"": ""Giovanni Ribisi"", ""order"": 5}代码:
tmdb_credit_df = pd.read_csv('tmdb.csv')
tmdb_credit_df['cast'] = tmdb_credit_df['cast'].apply(eval)cast列中的每个单元格都包含一个dicts列表。示例:
[{'cast_id': 242,
'character': 'Jake Sully',
'credit_id': '5602a8a7c3a3685532001c9a',
'gender': 2,
'id': 65731,
'name': 'Sam Worthington',
'order': 0},
{'cast_id': 3,
'character': 'Neytiri',
'credit_id': '52fe48009251416c750ac9cb',
'gender': 1,
'id': 8691,
'name': 'Zoe Saldana',
'order': 1}, ...]我试图将数据拉平,使其看起来像:
movie_id title cast_id character ...
0 19995 Avatar 242 Jake Sully ...
1 19995 Avatar 3 Neytiri ...是否有一种方法可以使用json_normalize()或.apply()来扁平/解压这个表,而不是循环遍历每一行?
我试图使用json_normalize(tmdb_credit_df.cast),但是我得到了一个错误:
'list' object has no attribute 'values'我还尝试tmdb_credit_df.cast.apply(lambda x: x[0])一次提取一个字段,但是我得到了以下错误:
list index out of range
发布于 2018-01-09 23:25:14
从-
df
movie_id title cast
0 19995 Avatar [{"cast_id": 242, "character": "Jake Sully", "...在这里,cast是一列字符串。
cast将cast列转换为python对象的列。df将to_dict转换为dictjson_normalize -
前两个步骤如下所示,使用apply(pd.io.json.loads) + to_dict -
d = df.assign(cast=df.cast.apply(pd.io.json.loads)).to_dict('r') 接下来,用meta和record_path参数调用record_path-
df = pd.io.json.json_normalize(d, meta=['movie_id', 'title'], record_path=['cast'])
df
cast_id character credit_id gender id \
0 242 Jake Sully 5602a8a7c3a3685532001c9a 2 65731
1 3 Neytiri 52fe48009251416c750ac9cb 1 8691
2 25 Dr. Grace Augustine 52fe48009251416c750aca39 1 10205
3 4 Col. Quaritch 52fe48009251416c750ac9cf 2 32747
4 5 Trudy Chacon 52fe48009251416c750ac9d3 1 17647
5 8 Selfridge 52fe48009251416c750ac9e1 2 1771
name order title movie_id
0 Sam Worthington 0 Avatar 19995
1 Zoe Saldana 1 Avatar 19995
2 Sigourney Weaver 2 Avatar 19995
3 Stephen Lang 3 Avatar 19995
4 Michelle Rodriguez 4 Avatar 19995
5 Giovanni Ribisi 5 Avatar 19995 发布于 2018-01-09 23:49:48
这在csv上循环,并将json (cast_str)转换为Python of dicts (cast),将其传递给Pandas的DataFrame构造函数(cast_df),将结果的DataFrame附加到list (frames)中,然后最后将frames的内容组合成一个大的DataFrame (df)。
import csv
import json
import pandas as pd
path = "/path/to/file/tmdb.csv"
frames = list()
reader = csv.reader(open(path))
next(reader) # skip the header row in csv
for movie_id_str, title, cast_str, _ in reader:
cast = json.loads(cast_str)
cast_df = pd.DataFrame(cast)
cast_df['movie_id'] = int(movie_id)
cast_df['title'] = title
frames.append(cast_df)
df = pd.concat(frames, ignore_index=True)注意,为了使测试数据能够正确地解析,我确实需要将]", "[]"添加到末尾来更改测试数据。
https://stackoverflow.com/questions/48177934
复制相似问题