进行 python 爬虫时,我们需要将数据存储起来。对于小数据量,我们可以存储到json文件或csv文件。
Python对json文件的操作是通过JSON模块实现的,分为编码和解码两个过程。
编码是指将python对象转换为json对象,常用的函数是dump()和dumps(),区别是dump()将python对象转换为json对象,并将json对象通过文件流写入文件,dumps()函数是返回字符串。
dump(obj, fp, skipkeys=False, ensure_ascii=True, check_circular=True,
allow_nan=True, cls=None, indent=None, separators=None,
encoding='utf-8', default=None, sort_keys=False, **kw)
dumps(obj, skipkeys=False, ensure_ascii=True, check_circular=True,
allow_nan=True, cls=None, indent=None, separators=None,
encoding='utf-8', default=None, sort_keys=False, **kw)
解码是指将json对象转换为python对象,常用的函数为load()和loads(),区别是load()是将json对象转换为python对象,并将python对象写入到文件,而loads()是返回字符串。
load(fp, encoding=None, cls=None, object_hook=None, parse_float=None,
parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)
loads(s, encoding=None, cls=None, object_hook=None, parse_float=None,
parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)
CSV(Comma-Separated Values,逗号分隔值,有时也称为字符分隔值),其文件是以纯文本形式存储表格数据。CSV文件由任意数目的记录组成,记录间以换行符分隔,每条记录由字段组成,字段间的分隔符是其他字符或字符串,最常见的是逗号和制表符。CSV文件示例如下:
ID,Username,Password,Age,Country
1001,yhw,123456,24,China
1002,Mary,654321,25,USA
1003,Jack,123567,22,English
Python通过 csv库来读写csv文件。Python写csv文件,需要用到write对象,然后按行写每一个记录;
importcsv
header = ("ID","Username","Password","Age","Country")
rows = [(1001,"yhw","123456",24,"China"),
(1002,"Mary","654321",25,"USA"),
(1003,"Jack","123567",22,"English")]
withopen("test3.csv","w")asfp:
fp_csv = csv.writer(fp)
fp_csv.writerow(header)
fp_csv.writerows(rows)
importcsv
rows = [{"ID":1001,"Username":"yhw","Password":"123456","Age":24,"Country":"China"},
{"ID":1002,"Username":"Mary","Password":"654321","Age":25,"Country":"USA"},
{"ID":1003,"Username":"Jack","Password":"123567","Age":22,"Country":"English"}]
withopen("test3.csv","w")asfp:
fp_csv = csv.DictWriter(fp, header)
fp_csv.writeheader()
fp_csv.writerows(rows)
Python读取 csv 文件需要用到reader对象,然后同样是按行读取每一条记录。要读取特定字段信息,python可以利用索引、命名元祖、字典这三种方式访问。
importcsv
fromcollectionsimportnamedtuple
withopen("test3.csv","r")asfp:
# 遍历记录
fp_csv = csv.reader(fp)
forrowinfp_csv:
printrow
# 索引
fp_csv = csv.reader(fp)
forrowinfp_csv:
printrow[]
# 命名元组
fp_csv = csv.reader(fp)
header = next(fp_csv)
Row = namedtuple("Row", header)
forrinfp_csv:
row = Row(*r)
printrow.Username, row.Password
printrow
# 字典
fp_csv = csv.DictReader(fp)
forrowinfp_csv:
printrow.get("Username")
领取专属 10元无门槛券
私享最新 技术干货