文章/答案/技术大牛

发布

社区首页 >问答首页 >Python -在使用FOR循环遍历pandas DataFrame时，使用if语句在字符串中查找子字符串

问Python -在使用FOR循环遍历pandas DataFrame时，使用if语句在字符串中查找子字符串
EN

Stack Overflow用户

提问于 2021-09-29 19:01:08

回答 6查看 95关注 0票数 2

我有一个看起来像这样的DataFrame。

                                     Variable
0                         Religion - Buddhism
1                            Source: Clickerz
2                            Religion - Islam
3                            Source: SRZ FREE
4   Ethnicity - Mixed - White & Black African

我想要操作variable列来创建一个new column，如下所示...

                                        Variable           New Column
    0                         Religion - Buddhism           Buddhism
    1                            Source: Clickerz           Clickerz
    2                            Religion - Islam            Islam
    3                            Source: SRZ FREE            SRZ FREE
    4   Ethnicity - Mixed - White & Black African         Mixed - White and Black African

这样我最终就可以有一个看起来像这样的DataFrame了.

                            Variable                      New Column
    0                       Religion                        Buddhism
    1                         Source                        Clickerz
    2                       Religion                           Islam
    3                         Source                        SRZ FREE
    4                      Ethnicity         Mixed - White and Black African

我想遍历Variable列并操作数据来创建New Column。我计划使用多个if语句来查找特定的单词，例如'Ethnicity'或'Religion'，然后应用操作。

例如..。

For row in df['Variable']:

      if 'Religion' in row:

              df['New Column'] = ...
      
      elif 'Ethnicity' in row:

              df['New Column'] = ...

      elif: 'Source' in row:

              df['New Column'] = ...

      else:

              df['New Column'] = 'Not Applicable'

尽管type(row)返回'str'，这意味着它属于string类，但这段代码一直将新列作为所有“不适用”的列返回，这意味着它没有检测到数据框中任何行中的任何字符串，即使我可以看到它们在那里。

我相信有一个简单的方法来做this...PLEASE帮助！

我也尝试过以下几种方法。

For row in df['Variable']:

  if row.find('Religion') != -1:

          df['New Column'] = ...

  elif row.find('Ethnicity') != -1:

          df['New Column'] = ...

  elif: row.find('Source') != -1:

          df['New Column'] = ...

  else:

          df['New Column'] = 'Not Applicable'

并且我继续得到新列的所有条目都是“不适用”。它再一次找不到现有列中的字符串。

是数据类型的问题还是什么？

python

pandas

dataframe

for-loop

if-statement

回答 6

Stack Overflow用户

回答已采纳

发布于 2021-10-02 10:58:52

我创建了一个'string_splitter‘函数，并将其应用到一个λ函数中，这样就解决了这个问题。

我创建了以下函数，根据单元格中包含的不同子字符串以不同的方式拆分字符串。

def string_splitter(cell):

word_list1 = ['Age', 'Disability', 'Religion', 'Gender']
word_list2 = ['Number shortlisted', 'Number Hired', 'Number Interviewed']

if any([word in cell for word in word_list1]):
    
    result = cell.split("-")[1]
    result = result.strip()
    
elif 'Source' in cell:
    
    result = cell.split(":")[1]
    result = result.strip()
    
elif 'Ethnicity' in cell:
    
    result_list = cell.split("-")[1:3]
    result = "-".join(result_list)
    result = result.strip()

elif any([word in cell for word in word_list2]):
    
    result = cell.split(" ")[1]
    result = result.strip()

elif 'Number of Applicants' in cell:
    
    result = cell


return result

然后，我在使用lambda操作时调用了string_splitter。当代码迭代数据帧中指定列的每一行时，这将函数单独应用于每个单元格。如下图所示：

df['Answer'] = df['Visual Type'].apply(lambda x: string_splitter(x))

string_splitter允许我创建New column。

然后，我创建了另一个函数column_formatter，用于在创建New Column后操作Variable列。第二个函数如下所示：

def column_formatter(cell):

word_list1 = ['Age', 'Gender', 'Ethnicity', 'Religion']
word_list2 = ['Number of Applicants', 'Number Hired', 'Number shortlisted', 'Number Interviewed']

if any([word in cell for word in word_list1]):
    
    result = cell.split("-")[0]
    result = result.strip()

elif 'Source' in cell:
    
    result = cell.split(":")[0]
    result = result.strip()

elif 'Disability' in cell:
    
    result = cell.split(" ")[0]
    result = result.strip()

elif any([word in cell for word in word_list2]):
    
    result = 'Number of Applicants'
    
else:
    
    result = 'Something wrong here'


return result

然后以相同的方式调用该函数，如下所示：

df['Visual Type'] = df['Visual Type'].apply(lambda x: column_formatter(x))

票数 0

Stack Overflow用户

发布于 2021-09-29 19:10:33

您可以使用嵌套的for循环：

# For each row in the dataframe
for row in df['column_variable']:
    # Set boolean to indicate if a substring was found
    substr_found = False

    # For each substring
    for sub_str in ["substring1", "substring2"]:
        # If the substring is in the row
        if sub_str in row:
            # Execute code...
            df['new_column'] = ...

            # Substring was found!
            substr_found = True

    # If substring was not found
    if not substr_found:
        # Set invalid code...
        df['new column'] = 'Not Applicable'

票数 1

Stack Overflow用户

发布于 2021-09-29 21:27:30

在操作DataFrame时，应尽可能避免遍历各行。这篇article解释了什么是更有效的替代方案。

您基本上是在尝试基于某个固定的映射来转换字符串。很自然地，dict就会出现在脑海中：

substring_map = {
    "at": "pseudo-cat",
    "dog": "true dog",
    "bre": "something else",    
    "na": "not applicable"
}

在处理大量子字符串的情况下，此映射可以从文件中读取，例如JSON文件。

子字符串匹配逻辑现在可以从映射定义中解耦：

def translate_substring(x):
  for substring, new_string in substring_map.items():
    if substring in x:
      return new_string
  return "not applicable"

使用带有'mapping‘函数的apply来生成您的目标列：

df = pd.DataFrame({"name":
  ["cat", "dogg", "breeze", "bred", "hat", "misty"]})

df["new_column"] = df["name"].apply(translate_substring)

# df:
#      name      new_column
# 0     cat      pseudo-cat
# 1    dogg        true dog
# 2  breeze  something else
# 3    bred  something else
# 4     hat      pseudo-cat
# 5   misty  not applicable

这段代码应用于pd.concat([df] * 10000) (60,000行)，在Colab笔记本上运行42ms。相比之下，使用iterrows只需3.67秒--速度提高了87倍。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69382068

复制

相似问题

问Python -在使用FOR循环遍历pandas DataFrame时，使用if语句在字符串中查找子字符串
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python -在使用FOR循环遍历pandas DataFrame时，使用if语句在字符串中查找子字符串EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python -在使用FOR循环遍历pandas DataFrame时，使用if语句在字符串中查找子字符串
EN