如果为low_memory=True (缺省值)，那么pandas将以行块的形式读取数据，然后将它们追加到一起。然后，一些列可能看起来像是整数和字符串混合在一起的块，这取决于在块期间pandas是否遇到了任何无法转换为整数的内容(比方说)。这可能会在以后造成问题。该警告告诉您，这种情况在读入过程中至少发生了一次，因此您应该小心。设置low_memory=False将使用更多内存，但可以避免该问题。

就我个人而言，我认为low_memory=True是一个糟糕的默认设置，但我所在的领域使用的小数据集比大型数据集多得多，因此便利性比效率更重要。

下面的代码演示了一个示例，其中设置了low_memory=True，并以混合类型传入一个列。它通过@firelynx构建答案

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO

# make a big csv data file, following earlier approach by @firelynx
csvdata = """1,Alice
2,Bob
3,Caesar
"""

# we have to replicate the "integer column" user_id many many times to get
# pd.read_csv to actually chunk read. otherwise it just reads 
# the whole thing in one chunk, because it's faster, and we don't get any 
# "mixed dtype" issue. the 100000 below was chosen by experimentation.
csvdatafull = ""
for i in range(100000):
    csvdatafull = csvdatafull + csvdata
csvdatafull =  csvdatafull + "foobar,Cthlulu\n"
csvdatafull = "user_id,username\n" + csvdatafull

sio = StringIO(csvdatafull)
# the following line gives me the warning:
    # C:\Users\rdisa\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
    # interactivity=interactivity, compiler=compiler, result=result)
# but it does not always give me the warning, so i guess the internal workings of read_csv depend on background factors
x = pd.read_csv(sio, low_memory=True) #, dtype={"user_id": int, "username": "string"})

x.dtypes
# this gives:
# Out[69]: 
# user_id     object
# username    object
# dtype: object

type(x['user_id'].iloc[0]) # int
type(x['user_id'].iloc[1]) # int
type(x['user_id'].iloc[2]) # int
type(x['user_id'].iloc[10000]) # int
type(x['user_id'].iloc[299999]) # str !!!! (even though it's a number! so this chunk must have been read in as strings)
type(x['user_id'].iloc[300000]) # str !!!!!

旁白:举一个例子，这是一个问题(我第一次遇到这个问题是一个严重的问题)，假设您在一个文件上运行pd.read_csv()，然后想要根据标识符删除重复项。假设标识符有时是数字的，有时是字符串的。一行可能是"81287"，另一行可能是"97324-32“。尽管如此，它们仍然是唯一标识符。

使用low_memory=True，pandas可能会读入标识符列，如下所示：

就因为它是分块的，所以有时标识符81287是一个数字，有时是一个字符串。当我尝试删除基于此的副本时，

81287 == "81287"
Out[98]: False

票数 4

查看全部 11 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/24251219

复制

相似问题

问Pandas read_csv low_memory和dtype选项
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Pandas read_csv low_memory和dtype选项EN