开始学习熊猫,可能因为它而迷路了,所以需要一些帮助。
我正在尝试自动化一个编辑csv文件的过程。我正在接收未排序的那些,并试图让他们准备去进行一些批量更新。
因此,我需要做的是:1.重命名所需的列(我做得不错)2.在标题列3中替换两个单行的现有双引号。在所有标题(即第一列)中添加双引号。4. do列(不确定什么是最好的- del或df.drop)。
这是我现在尝试过的代码,它使用的代码是:
import pandas as pd
df = pd.read_csv('Sheet1.csv')
droped = df.drop(df.columns[[2,3,4,5,6,7,8,9,10]],axis = 1,inplace=True)
renamed = df.rename(columns={df.columns[0]: 'title', df.columns[1]: 'product_id'})
swapped = renamed[[0]].str.replace(r'\"', r"\''")
updated = swapped.update('"' + df.columns[[0]].astype(str) + '"')
print(renamed.head())
这些引号现在让我头疼--有些标题里面有双引号--我需要删除它们,然后在每个标题周围添加双引号。
前三行工作正常,我可以删除不必要的列并重命名其余的列,但是最后两个操作写得不好。我试图找出谷歌搜索,但到目前为止没有运气。
此引号交换的示例:
现在:巴尼尼电影“提瑟想要的:”巴尼尼电影“剧院”
错误:
Traceback (most recent call last):
File "Desktop/testing_scripts/bulk_script.py", line 6, in <module>
swapped = renamed[[0]].str.replace(r'\"', r"\''")
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 3511, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5782, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5842, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([0], dtype='int64')] are in the [columns]"
发布于 2022-07-01 18:43:07
为了回答你的问题,你只是有一个错误。
# Here, you rename your columns:
renamed = df.rename(columns={df.columns[0]: 'title', df.columns[1]: 'product_id'})
# Here, you try to access the column by its old name...
renamed[[0]]
> None of [Int64Index([0], dtype='int64')] are in the [columns]
# This error is correct, only 'title' and 'product_id' are in 'columns'.
# What it could be:
renamed['title']
# Or to access by col #:
renamed.iloc[:, 0]
特定列中的
from io import StringIO
# StringIO just allows me to treat a string like a file,
# you'll use your file name in read_csv instead.
import pandas as pd
file = '0,1,2,3,4\nsr,ddpi,np.log(dpi),pop75,np.log(pop15)\nnp.log(pop15),sr,ddpi,np.log(dpi),pop75\npop75,np.log(pop15),sr,ddpi,np.log(dpi)\nnp.log(dpi),pop75,np.log(pop15),sr,ddpi\nddpi,np.log(dpi),pop75,np.log(pop15),sr\n'
print(pd.read_csv(StringIO(file)))
print(pd.read_csv(StringIO(file), usecols=[0,1]))
输出时,我们可以看到,通过使用关键字usecols
,我们可以在读取文件时对我们想要的列进行筛选,而不必在以后删除它们:
0 1 2 3 4
0 sr ddpi np.log(dpi) pop75 np.log(pop15)
1 np.log(pop15) sr ddpi np.log(dpi) pop75
2 pop75 np.log(pop15) sr ddpi np.log(dpi)
3 np.log(dpi) pop75 np.log(pop15) sr ddpi
4 ddpi np.log(dpi) pop75 np.log(pop15) sr
0 1
0 sr ddpi
1 np.log(pop15) sr
2 pop75 np.log(pop15)
3 np.log(dpi) pop75
4 ddpi np.log(dpi)
另外,另一种“删除”列的方法是,当您真正想要的只是保留几个列时:
df = pd.read_csv(StringIO(file))
df = df[['0', '1']]
print(df)
0 1
0 sr ddpi
1 np.log(pop15) sr
2 pop75 np.log(pop15)
3 np.log(dpi) pop75
4 ddpi np.log(dpi)
当您知道列的顺序时,
# Personally, I find these methods to be simpler for the case you've presented~
df = pd.read_csv(StringIO(file), usecols=[0,1])
df.columns = ['title', 'product_id']
# OR
# If your file doesn't have a header, then remove 'header=0'
df = pd.read_csv(StringIO(file), usecols=[0,1], header=0, names=['title', 'product_id'])
print(df)
title product_id
0 sr ddpi
1 np.log(pop15) sr
2 pop75 np.log(pop15)
3 np.log(dpi) pop75
4 ddpi np.log(dpi)
file = '0,1,2,3,4\nsr,Banini Movie" Teather,np.log(dpi),pop75,np.log(pop15)\nnp.log(pop15),sr,ddpi,np.log(dpi),pop75\npop75,Banini Movie" Teather,sr,ddpi,np.log(dpi)\nBanini Movie" Teather,pop75,Banini Movie" Teather,sr,ddpi\nddpi,np.log(dpi),pop75,Banini Movie" Teather,sr\n'
df = pd.read_csv(StringIO(file), usecols=[0,1], header=0, names=['title', 'product_id'])
print(df)
title product_id
0 sr Banini Movie" Teather
1 np.log(pop15) sr
2 pop75 Banini Movie" Teather
3 Banini Movie" Teather pop75
4 ddpi np.log(dpi)
# To do it to one column:
df['title'] = df['title'].str.replace('"', "''")
df['title'] = '"' + df['title'] + '"'
print(df)
title product_id
0 "sr" Banini Movie" Teather
1 "np.log(pop15)" sr
2 "pop75" Banini Movie" Teather
3 "Banini Movie'' Teather" pop75
4 "ddpi" np.log(dpi)
# To do it to all columns:
for col in df:
df[col] = df[col].str.replace('"', "''")
df[col] = '"' + df[col] + '"'
print(df)
title product_id
0 "sr" "Banini Movie'' Teather"
1 "np.log(pop15)" "sr"
2 "pop75" "Banini Movie'' Teather"
3 "Banini Movie'' Teather" "pop75"
4 "ddpi" "np.log(dpi)"
https://stackoverflow.com/questions/72820886
复制相似问题