问pandas、python、excel、在DF1的列中搜索子字符串以将字符串写入df2中的列
EN

Stack Overflow用户

提问于 2018-07-06 01:42:31

回答 2查看 5.6K关注 0票数 1

我正在使用python中的pandas包来处理和读写excel电子表格。我已经创建了两个不同的数据帧(df1和df2)，它们的单元格都是数据类型字符串。df1有超过50,000行。在df1的每一列中都有许多是“Nan”的单元格，我已经将其转换为一个表示“空”的字符串。df2有超过9000行。“WHSE_Nbr”和“WHSE_Desc_HR”中的每一行都包含一个准确的字符串值。在df2的最后两列中，只有某些行的值不是字符串“Empty”。df1中的“Warehouse”列有许多包含只有单词的名称的单元格。我想要识别的df1中“warehouse”列的行是那些包含在df2中“WHSE_Nbr”列中找到的仓库编号的行。

Example of dataframe1 - df1
Job         Warehouse          GeneralDescription      Purpose
Empty       AP                 Accounts Payable        Accounting
Empty       Empty              Empty                   Empty
Empty       Cyber Security GA  Security & Compliance   Data Security
Empty       Merch|04-1854      Empty                   Empty
Empty       WH -1925           Empty                   Empty
Empty       Montreal-10        Empty                   Empty
Empty       canada| 05-4325    Empty                   Empty

        Example of dataframe2 - df2


WHSE_Nbr    WHSE_Desc_HR         WHSE_Desc_AD    WHSE_Abrv
1           Technology                           Tech
2           Finance                 
...         ...                 
10          Recruiting           Campus Outreach
1854        Community Relations
...         ...
1925        HumanResources
4325        Global People
9237        International Tech

dataframe2 df2示例

因此，我希望遍历df1的“Warehouse column”的所有行，以搜索出现在df2的WHSE_Nbr列中的WHSE编号。在本例中，我希望我的代码在df1的“Warehouse”列中找到1854，并将该数字映射到df2的WHSE_Desc_HR列中的关联单元格，并在df1的“GeneralDescription”列中写入“社区关系”(到Warehouse列中包含子字符串“1854”的同一行。它还会将“Human Resources”写入到Warehouse列中的同一行子字符串“1925”中。当迭代达到“蒙特利尔10”时，我希望我的代码将“校园扩展”写到df1的GeneralDescription列中，因为如果WHSE_Desc_AD of df2中有一个值，这将覆盖df2的“WHSE_Desc_HR”列中的值。我已经足够熟悉pandas，可以读取excel文件(.xlsx)，创建数据框并更改数据框中的数据类型以进行迭代，查看数据框，但无法找到构建此代码的最有效和最高效的方法来实现此目标。我刚才不得不编辑这个问题，因为我意识到我遗漏了一些非常重要的东西。每当数字出现在Warehouse列中时，我要匹配的数字总是跟在连字符或短划线(-)之后。因此，在加拿大中，显示“df1 | 05-4325”的仓库行应该识别4325，将其与df2匹配，并将"Global People“写入df1中的GeneralDescription列。抱歉，各位。非常感谢你的帮助，下面的两个答案是一个很好的开始。谢谢

import pandas as pd

excel_file='/Users/cbri/anaconda3/WHSE_gen.xlsx'
df1 = pd.read_excel(excel_file, usecols [1,5,6,7])
excel_file='/Users/cbri/PycharmProjects/True_Dept/HR_excel.xlsx'
df2 = pd.read_excel(excel_file)
df1=df1.replace(np.nan, "Empty",regex=True)
df2=df2.replace(np.nan, "Empty",regex=True)
df1=pd.DataFrame(df1, dtype='str')
df2=pd.DataFrame(df2, dtype='str')

#yeah i need a push in the right direction, guess i should use ieriterms()?
for column in df1:
     if (df1['Warehouse'])    
#so i got as far as returning all records that contained the substring "1854" but obviously that's without the for and if statement above
     df1[df1['Warehouse'].str.contains("1854", na=False)]

python

excel

pandas

dataframe

string-search

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-07-06 02:36:21

我要做的是写一个正则表达式来从你的列中提取数字连接表格，然后可能在excel中完成其余的工作。(该列将更新)

df1 = pd.DataFrame({'Department' : ['Merch - 1854', '1925 - WH','Montreal 10'],'TrueDeparment' : ['Empty','empty','empty']})
df2 = pd.DataFrame({'Dept_Nbr' : [1854, 1925, 10], 'Dept_Desc_HR' : ['Community Relations','Human Resources','Recruiting']})

然后你可以在这里尝试这个函数所做的事情：

line = 'Merch - 1854 '
match = re.search(r'[0-9]+', line)
if match is None:
    print(0)
else:
    print(int(match[0]))

如果您需要在注释中指定的字符后进行匹配，请使用以下命令：

line = '12125 15151 Merch -1854 '
match = re.search(r'(?<=-)[0-9]+', line)
if match is None:
    print(0)
else:
    print(int(match[0]))

请注意，如果"-“后面有空格或其他字符，则需要将其添加到正则表达式中才能工作！

重要-假设文本中只有一个数字-如果不是，则返回0。您可以按自己的意愿更改它，重点是至少不会失败

编写函数：

def extract_number(field):
    match = re.search(r'(?<=-)[0-9]+', field)
    if match is None:
         return 0
    else:
         return int(match[0])

适用于数据帧：

 df1['num_col'] = df1[['Department']].apply(lambda row:extract_number(row['Department']),axis=1)

最后，执行连接：

df1.merge(df2, left_on = ['num_col'], right_on = ['Dept_Nbr'])

从这里你可以知道你需要哪一列，不管是Python还是excel。

票数 1

Stack Overflow用户

发布于 2018-07-06 02:24:16

试试这个：

numbers = df2['Dept_Nbr'].tolist()
df2['Dept_Nbr'] = [int(i) for i in df2['Dept_Nbr']]
df2.set_index('Dept_Nbr')
for n in numbers:
    for i in df1.index:
        if n in df1.at[i, 'Department']:
            if df2.at[int(n), 'Dept_Desc_AD']: #if values exists
                df1.at[i, 'TrueDepartment'] = df2.at(int(n), 'Dept_Desc_AD')
            else:
                df1.at[i, 'TrueDepartment'] = df2.at(int(n), 'Dept_Desc_HR')

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51197176

复制

相似问题

问pandas、python、excel、在DF1的列中搜索子字符串以将字符串写入df2中的列
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问pandas、python、excel、在DF1的列中搜索子字符串以将字符串写入df2中的列EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问pandas、python、excel、在DF1的列中搜索子字符串以将字符串写入df2中的列
EN