我目前正在跟踪的大规模数字化的录像带和需要帮助从多个CSV的数据。大多数磁带都有多份拷贝,但我们只将一盘磁带从电视机上数字化。我想创建一个新的CSV,只包含尚未数字化的节目磁带。下面是我最初的CSV的模型:
Date Digitized | Series | Episode Number | Title | Format
---------------|----------|----------------|-------|--------
01-01-2016 | Series A | 101 | | VHS
| Series A | 101 | | Beta
| Series A | 101 | | U-Matic
| Series B | 101 | | VHS在这里,我想忽略包含“系列A”和"101“的所有字段,因为这个显示在”日期数字化“单元格中有一个值。我试图隔离这些条件,但似乎无法获得未数字化内容的完整列表。这是我的密码:
import csv, glob
names = glob.glob("*.csv")
names = [os.path.splitext(each)[0] for each in names]
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader( source )
with open("%s_edit.csv" % name,"wb") as result:
writer = csv.writer( result )
for row in reader:
if row[0]:
series = row[1]
epnum = row[2]
if row[1] != series and row[2] != epnum:
writer.writerow(row)我要补充的是,这是我的第一个问题,我对Python非常陌生,所以任何建议都将不胜感激!
发布于 2016-09-16 16:13:54
最简单的方法是对一组CSV文件进行两次读取:一种是建立所有数字化磁带的列表,另一种是建立一个唯一的列表,列出不在数字化列表上的所有磁带:
# build list of digitized tapes
digitized = []
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader(source)
next(reader) # skip header
for row in reader:
if row[0] and ((row[1], row[2]) not in digitized):
digitized.append((row[1], row[2]))
# build list of non-digitized tapes
digitize_me = []
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader(source)
header = next(reader)[1:3] # skip / save header
for row in reader:
if not row[0] and ((row[1], row[2]) not in digitized + digitize_me):
digitize_me.append((row[1], row[2]))
# write non-digitized tapes to 'digitize.csv`
with open("digitize.csv","wb") as result:
writer = csv.writer(result)
writer.writerow(header)
for tape in digitize_me:
writer.writerow(tape)输入文件1:
Date Digitized,Series,Episode Number,Title,Format
01-01-2016,Series A,101,,VHS
,Series A,101,,Beta
,Series C,101,,Beta
,Series D,102,,VHS
,Series B,101,,U-Matic输入文件2:
Date Digitized,Series,Episode Number,Title,Format
,Series B,101,,VHS
,Series D,101,,Beta
01-01-2016,Series C,101,,VHS输出:
Series,Episode Number
Series D,102
Series B,101
Series D,101根据OP注释,行
header = next(reader)[1:3] # skip / save header有两个目的:
csv文件都以头开头,我们不希望读取标题行,就好像它包含有关磁带的数据一样,因此我们需要在这个意义上“跳过”标题行。csv文件时保存头部的相关部分。我们希望那个文件也有一个标题。因为我们只编写series和episode number,它们是row字段1和2,所以我们只将标题行的那个片段(即[1:3] )分配给header变量。让一行代码用于这样两个完全不相关的目的并不是真正的标准,这也是我评论它的原因。它还多次分配给header (假设有多个输入文件),而header只需要分配一次。也许编写这一节的一个更干净的方法是:
# build list of non-digitized tapes
digitize_me = []
header = None
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader(source)
if header:
next(reader) # skip header
else:
header = next(reader)[1:3] # read header
for row in reader:
...这是哪种形式更易读的问题。这两种方法都很接近,但我认为将5行代码合并为一行可以使代码的重点放在更突出的部分上。下一次我可能会用另一种方式。
发布于 2016-09-16 17:13:59
我不能百分之百地肯定我已经理解了你的需要。然而,这可能会让你走上正确的轨道。我使用的是pandas模块:
data = """
Date Digitized | Series | Episode Number | Title | Format
---------------|----------|----------------|-------|--------
01-01-2016 | Series A | 101 | | VHS
| Series A | 101 | | Beta
| Series A | 101 | | U-Matic
| Series B | 101 | | VHS"""
# useful module for treating csv files (and many other)
import pandas as pd
# module to handle data as it was a csv file
import io
# read the csv into pandas DataFrame
# use the 0 row as a header
# fields are separated by |
df = pd.read_csv(
io.StringIO(data),
header=0,
sep="|"
)
# there is a bit problem with white spaces
# remove white space from the column names
df.columns = [x.strip() for x in df.columns]
# remove white space from all string fields
df = df.applymap(lambda x: x.strip() if type(x) == str else x)
# finally choose the subset we want
# for some reason pandas guessed the type of Episode Number wrong
# it should be integer, this probably won't be a problem when loading
# directly from file
df = df[~((df["Series"] == "Series A") & (df["Episode Number"] == "101"))]
# print the result
print(df)
# Date Digitized Series Episode Number Title Format
# 0 --------------- ---------- ---------------- ------- --------
# 4 Series B 101 VHS请随意询问,希望我能够根据您的实际需要或以任何其他方式帮助您更改代码。
https://stackoverflow.com/questions/39515291
复制相似问题