我是merging两个CSV文件输入如下。
file1.csv
Id,attr1,attr2,attr3
1,True,7,"Purple"
2,False,19.8,"Cucumber"
3,False,-0.5,"A string with a comma, because it has one"
4,True,2,"Nope"
5,True,4.0,"Tuesday"
6,False,1,"Failure"file2.csv
Id,attr4,attr5,attr6
2,"python",500000.12,False
5,"program",3,True
3,"Another string",-5,False当我运行这段代码时
import pandas as pd
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
merged = df1.merge(df2, on="Id", how="outer").fillna("")
merged.to_csv("merged.csv", index=False)我得到这样的输出
Id,attr1,attr2,attr3,attr4,attr5,attr6
1,True,7.0,Purple,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5.0,False
4,True,2.0,Nope,,,
5,True,4.0,Tuesday,program,3.0,True
6,False,1.0,Failure,,,注意,我的一些记录中的attr2已经从int转换为float。
1,True,7.0,Purple,,,与预期
1,True,7,Purple,,,对于这个示例数据集来说,这是一个小麻烦。但是,当我针对我的大量数据运行它时,这种行为也会出现在我的Id列上。这将进一步破坏我的工作流链中的进程。
如何防止熊猫对整个文件或理想情况下对特定列进行这种转换?
发布于 2014-04-28 18:54:43
您可以向dtype参数传递一个值(如果希望影响整个dtype,可以传递一个类型,如果要影响单个列,则传递一个字典):
>>> df = pd.read_csv("file1.csv", dtype={"id": int, "attr2": str})
>>> df
id attr1 attr2 attr3
0 1 True 7 Purple
1 2 False 19.8 Cucumber
2 3 False -0.5 A string with a comma, because it has one
3 4 True 2 Nope
4 5 True 4.0 Tuesday
5 6 False 1 Failure
[6 rows x 4 columns]
>>> df.dtypes
id int32
attr1 bool
attr2 object
attr3 object
dtype: objecthttps://stackoverflow.com/questions/23348883
复制相似问题