使用pyarrow
将包含Player
对象的pandas.DataFrame
转换为具有以下代码的pyarrow.Table
import pandas as pd
import pyarrow as pa
class Player:
def __init__(self, name, age, gender):
self.name = name
self.age = age
self.gender = gender
def __repr__(self):
return f'<{self.name} ({self.age})>'
data = [
Player('Jack', 21, 'm'),
Player('Ryan', 18, 'm'),
Player('Jane', 35, 'f'),
]
df = pd.DataFrame(data, columns=['player'])
print(pa.Table.from_pandas(df))
我们得到错误:
pyarrow.lib.ArrowInvalid: ('Could not convert <Jack (21)> with type Player: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column 0 with type object')
使用的相同错误
df.to_parquet('players.pq')
pyarrow
有可能倒退到使用pickle
序列化这些Python对象吗?还是有更好的解决方案?pyarrow.Table
最终将使用Parquet.write_table()
写入磁盘。
使用Python3.8.0的0.13.0.
pandas.DataFrame.to_parquet()
pq.write_table(pa.Table.from_dataframe(pandas.DataFrame))
的解决方案。谢谢!
发布于 2021-04-15 10:47:54
我的建议是将数据插入已经序列化的DataFrame中。
最佳选择-使用dataclass (python >=3.7)
由装饰器将Player类定义为dataclass,并让序列化为您自己完成(到JSON)。
import pandas as pd
from dataclasses import dataclass
@dataclass
class PlayerV2:
name:str
age:int
gender:str
def __repr__(self):
return f'<{self.name} ({self.age})>'
dataV2 = [
PlayerV2(name='Jack', age=21, gender='m'),
PlayerV2(name='Ryan', age=18, gender='m'),
PlayerV2(name='Jane', age=35, gender='f'),
]
# The serialization is done natively to JSON
df_v2 = pd.DataFrame(data, columns=['player'])
print(df_v2)
# Can still get the objects's attributes by deserializeing the record
json.loads(df_v2["player"][0])['name']
手动序列化对象(python < 3.7)
在Player类中定义序列化函数,并在创建Dataframe之前序列化每个实例。
import pandas as pd
import json
class Player:
def __init__(self, name, age, gender):
self.name = name
self.age = age
self.gender = gender
def __repr__(self):
return f'<{self.name} ({self.age})>'
# The serialization function for JSON, if for some reason you really need pickle you can use it instead
def toJSON(self):
return json.dumps(self, default=lambda o: o.__dict__)
# Serialize the objects before inserting it into the DataFrame
data = [
Player('Jack', 21, 'm').toJSON(),
Player('Ryan', 18, 'm').toJSON(),
Player('Jane', 35, 'f').toJSON(),
]
df = pd.DataFrame(data, columns=['player'])
# You can see all the data inserted as a serialized json into the column player
print(df)
# Can still get the objects's attributes by deserializeing the record
json.loads(df["player"][0])['name']
发布于 2021-01-25 16:29:37
根据我的理解,“type”有问题,因为repr尝试这种方法(它有效):
class Player:
def __init__(self, name, age, gender):
self.name = name
self.age = age
self.gender = gender
def other(self):
return f'<{self.name} ({self.age})>'
data = [
Player('Jack', 21, 'm').other(),
Player('Ryan', 18, 'm').other(),
Player('Jane', 35, 'f').other(),
]
df = pd.DataFrame(data, columns=['player'])
print(df)
player
0 <Jack (21)>
1 <Ryan (18)>
2 <Jane (35)>
print(pa.Table.from_pandas(df))
pyarrow.Table
player: string
发布于 2022-03-25 14:23:31
不确定是拼花支持格式。但它适用于迪克特,列表。
一个蟒蛇类。通过调用object.dict来获得对象的字典表示形式。
例如,以下工作
from dataclasses import dataclass
import pandas as pd
import pyarrow as pa
@dataclass
class Player:
name: str
age: int
gender: str
players = [
{"name": "player1", "age": 12, "gender": "f"},
{"name": "player2", "age": 22, "gender": "m"},
{"name": "player3", "age": 18, "gender": "m"}
]
df = pd.DataFrame()
df["players"] = [Player(**r).__dict__ for r in players]
pa.Table.from_pandas(df)
https://stackoverflow.com/questions/59636745
复制相似问题