我有一个看起来像这样的DataFrame。
Variable
0 Religion - Buddhism
1 Source: Clickerz
2 Religion - Islam
3 Source: SRZ FREE
4 Ethnicity - Mixed - White & Black African
我想要操作variable
列来创建一个new column
,如下所示...
Variable New Column
0 Religion - Buddhism Buddhism
1 Source: Clickerz Clickerz
2 Religion - Islam Islam
3 Source: SRZ FREE SRZ FREE
4 Ethnicity - Mixed - White & Black African Mixed - White and Black African
这样我最终就可以有一个看起来像这样的DataFrame了.
Variable New Column
0 Religion Buddhism
1 Source Clickerz
2 Religion Islam
3 Source SRZ FREE
4 Ethnicity Mixed - White and Black African
我想遍历Variable
列并操作数据来创建New Column
。我计划使用多个if
语句来查找特定的单词,例如'Ethnicity'
或'Religion'
,然后应用操作。
例如..。
For row in df['Variable']:
if 'Religion' in row:
df['New Column'] = ...
elif 'Ethnicity' in row:
df['New Column'] = ...
elif: 'Source' in row:
df['New Column'] = ...
else:
df['New Column'] = 'Not Applicable'
尽管type(row)
返回'str'
,这意味着它属于string类,但这段代码一直将新列作为所有“不适用”的列返回,这意味着它没有检测到数据框中任何行中的任何字符串,即使我可以看到它们在那里。
我相信有一个简单的方法来做this...PLEASE帮助!
我也尝试过以下几种方法。
For row in df['Variable']:
if row.find('Religion') != -1:
df['New Column'] = ...
elif row.find('Ethnicity') != -1:
df['New Column'] = ...
elif: row.find('Source') != -1:
df['New Column'] = ...
else:
df['New Column'] = 'Not Applicable'
并且我继续得到新列的所有条目都是“不适用”。它再一次找不到现有列中的字符串。
是数据类型的问题还是什么?
发布于 2021-10-02 10:58:52
我创建了一个'string_splitter‘函数,并将其应用到一个λ函数中,这样就解决了这个问题。
我创建了以下函数,根据单元格中包含的不同子字符串以不同的方式拆分字符串。
def string_splitter(cell):
word_list1 = ['Age', 'Disability', 'Religion', 'Gender']
word_list2 = ['Number shortlisted', 'Number Hired', 'Number Interviewed']
if any([word in cell for word in word_list1]):
result = cell.split("-")[1]
result = result.strip()
elif 'Source' in cell:
result = cell.split(":")[1]
result = result.strip()
elif 'Ethnicity' in cell:
result_list = cell.split("-")[1:3]
result = "-".join(result_list)
result = result.strip()
elif any([word in cell for word in word_list2]):
result = cell.split(" ")[1]
result = result.strip()
elif 'Number of Applicants' in cell:
result = cell
return result
然后,我在使用lambda操作时调用了string_splitter
。当代码迭代数据帧中指定列的每一行时,这将函数单独应用于每个单元格。如下图所示:
df['Answer'] = df['Visual Type'].apply(lambda x: string_splitter(x))
string_splitter
允许我创建New column
。
然后,我创建了另一个函数column_formatter
,用于在创建New Column
后操作Variable
列。第二个函数如下所示:
def column_formatter(cell):
word_list1 = ['Age', 'Gender', 'Ethnicity', 'Religion']
word_list2 = ['Number of Applicants', 'Number Hired', 'Number shortlisted', 'Number Interviewed']
if any([word in cell for word in word_list1]):
result = cell.split("-")[0]
result = result.strip()
elif 'Source' in cell:
result = cell.split(":")[0]
result = result.strip()
elif 'Disability' in cell:
result = cell.split(" ")[0]
result = result.strip()
elif any([word in cell for word in word_list2]):
result = 'Number of Applicants'
else:
result = 'Something wrong here'
return result
然后以相同的方式调用该函数,如下所示:
df['Visual Type'] = df['Visual Type'].apply(lambda x: column_formatter(x))
发布于 2021-09-29 19:10:33
您可以使用嵌套的for
循环:
# For each row in the dataframe
for row in df['column_variable']:
# Set boolean to indicate if a substring was found
substr_found = False
# For each substring
for sub_str in ["substring1", "substring2"]:
# If the substring is in the row
if sub_str in row:
# Execute code...
df['new_column'] = ...
# Substring was found!
substr_found = True
# If substring was not found
if not substr_found:
# Set invalid code...
df['new column'] = 'Not Applicable'
发布于 2021-09-29 21:27:30
在操作DataFrame
时,应尽可能避免遍历各行。这篇article解释了什么是更有效的替代方案。
您基本上是在尝试基于某个固定的映射来转换字符串。很自然地,dict
就会出现在脑海中:
substring_map = {
"at": "pseudo-cat",
"dog": "true dog",
"bre": "something else",
"na": "not applicable"
}
在处理大量子字符串的情况下,此映射可以从文件中读取,例如JSON文件。
子字符串匹配逻辑现在可以从映射定义中解耦:
def translate_substring(x):
for substring, new_string in substring_map.items():
if substring in x:
return new_string
return "not applicable"
使用带有'mapping‘函数的apply
来生成您的目标列:
df = pd.DataFrame({"name":
["cat", "dogg", "breeze", "bred", "hat", "misty"]})
df["new_column"] = df["name"].apply(translate_substring)
# df:
# name new_column
# 0 cat pseudo-cat
# 1 dogg true dog
# 2 breeze something else
# 3 bred something else
# 4 hat pseudo-cat
# 5 misty not applicable
这段代码应用于pd.concat([df] * 10000)
(60,000行),在Colab笔记本上运行42ms。相比之下,使用iterrows
只需3.67秒--速度提高了87倍。
https://stackoverflow.com/questions/69382068
复制相似问题