首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何在不同形状的熊猫中合并数据帧?

如何在不同形状的熊猫中合并数据帧?
EN

Stack Overflow用户
提问于 2020-11-24 05:21:03
回答 2查看 46关注 0票数 1

我正在尝试将pandas中的两个数据帧与大量数据合并,但这给我带来了一些问题。我将试着用一个更小的例子来说明。

df1有一个设备列表和几个与设备相关的列:

代码语言:javascript
运行
复制
Item ID Equipment     Owner Status   Location
1       Jackhammer    James Active   London
2       Cement Mixer  Tim   Active   New York
3       Drill         Sarah Active   Paris
4       Ladder        Luke  Inactive Hong Kong
5       Winch         Kojo  Inactive Sydney
6       Circular Saw  Alex  Active   Moscow

df2有一个已使用设备的实例列表。它有一些与df1类似的列,但是其中一些字段是NaN值,也记录了不在df1中的设备实例:

代码语言:javascript
运行
复制
Item ID Equipment     Owner Date       Location
1       Jackhammer    James 08/09/2020 London
1       Jackhammer    James 08/10/2020 London
2       Cement Mixer  NaN   29/02/2020 New York
3       Drill         Sarah 11/02/2020 NaN
3       Drill         Sarah 30/11/2020 NaN
3       Drill         Sarah 21/12/2020 NaN
6       Circular Saw  Alex  19/06/2020 Moscow
7       Hammer        Ken   21/12/2020 Toronto
8       Sander        Ezra  19/06/2020 Frankfurt

我希望最终得到的数据帧是:

代码语言:javascript
运行
复制
Item ID Equipment     Owner Status   Date       Location
1       Jackhammer    James Active   08/09/2020 London
1       Jackhammer    James Active   08/10/2020 London
2       Cement Mixer  Tim   Active   29/02/2020 New York
3       Drill         Sarah Active   11/02/2020 Paris
3       Drill         Sarah Active   30/11/2020 Paris
3       Drill         Sarah Active   21/12/2020 Paris
4       Ladder        Luke  Inactive NaN        Hong Kong
5       Winch         Kojo  Inactive NaN        Sydney
6       Circular Saw  Alex  Active   19/06/2020 Moscow
7       Hammer        Ken   NaN      21/12/2020 Toronto
8       Sander        Ezra  NaN      19/06/2020 Frankfurt

相反,在下面的代码中,我得到了重复的行,我认为是因为NaN的值:

代码语言:javascript
运行
复制
data = pd.merge(df1, df2, how='outer', on=['Item ID'])

Item ID Equipment_x  Equipment_y Owner_x Owner_y Status   Date       Location_x  Location_y
1       Jackhammer   NaN         James   James   Active   08/09/2020 London      London
1       Jackhammer   NaN         James   James   Active   08/10/2020 London      London
2       Cement Mixer NaN         Tim     NaN     Active   29/02/2020 New York    New York
3       Drill        NaN         Sarah   Sarah   Active   11/02/2020 Paris       NaN
3       Drill        NaN         Sarah   Sarah   Active   30/11/2020 Paris       NaN
3       Drill        NaN         Sarah   Sarah   Active   21/12/2020 Paris       NaN
4       Ladder       NaN         Luke    NaN     Inactive NaN        Hong Kong   Hong Kong
5       Winch        NaN         Kojo    NaN     Inactive NaN        Sydney      Sydney
6       Circular Saw NaN         Alex    NaN     Active   19/06/2020 Moscow      Moscow
7       NaN          Hammer      NaN     Ken     NaN      21/12/2020 NaN         Toronto
8       NaN          Sander      NaN     Ezra    NaN      19/06/2020 NaN         Frankfurt

理想情况下,我可以只删除_y列,但是底部行中的数据意味着我将丢失重要信息。相反,我唯一能想到的就是合并列,并强制pandas比较每一列中的值,并始终偏爱非NaN值。我不确定这是不是可能?

EN

回答 2

Stack Overflow用户

发布于 2020-11-24 06:49:16

合并列并强制pandas比较每列中的值,并始终偏爱非NaN值。

这就是你的意思吗?

代码语言:javascript
运行
复制
In [45]: data = pd.merge(df1, df2, how='outer', on=['Item ID', 'Equipment'])                         

In [46]: data['Location'] = data['Location_y'].fillna(data['Location_x'])                            

In [47]: data['Owner'] = data['Owner_y'].fillna(data['Owner_x'])                                     

In [48]: data = data.drop(['Location_x', 'Location_y', 'Owner_x', 'Owner_y'], axis=1)                

In [49]: data                                                                                        
Out[49]: 
    Item ID     Equipment    Status        Date   Location  Owner
0         1    Jackhammer    Active  08/09/2020     London  James
1         1    Jackhammer    Active  08/10/2020     London  James
2         2  Cement Mixer    Active  29/02/2020   New York    Tim
3         3         Drill    Active  11/02/2020      Paris  Sarah
4         3         Drill    Active  30/11/2020      Paris  Sarah
5         3         Drill    Active  21/12/2020      Paris  Sarah
6         4        Ladder  Inactive         NaN  Hong Kong   Luke
7         5         Winch  Inactive         NaN     Sydney   Kojo
8         6  Circular Saw    Active  19/06/2020     Moscow   Alex
9         7        Hammer       NaN  21/12/2020    Toronto    Ken
10        8        Sander       NaN  19/06/2020  Frankfurt   Ezra

(据我所知)你不能真的在空列上合并。但是,如果该值为NaN,则可以使用fillna获取该值并将其替换为其他值。这不是一个非常优雅的解决方案,但它似乎至少解决了您的示例。

另请参阅pandas combine two columns with null values

票数 0
EN

Stack Overflow用户

发布于 2020-11-24 07:03:08

一般而言,您可以按如下方式执行此操作:

代码语言:javascript
运行
复制
# merge the two dataframes using a suffix that ideally does
# not appear in your data    
suffix_string='_DF2'
data = pd.merge(df1, df2, how='outer', on=['Item_ID'], suffixes=('', suffix_string))
# now remove the duplicate columns by mergeing the content
# use the value of column + suffix_string if column is empty
columns_to_remove= list()
for col in df1.columns:
    second_col= f'{col}{suffix_string}'
    if second_col in data.columns:
        data[col]= data[second_col].where(data[col].isna(), data[col])
        columns_to_remove.append(second_col)
if columns_to_remove:
    data.drop(columns=columns_to_remove, inplace=True)
data

结果是:

代码语言:javascript
运行
复制
    Item_ID     Equipment  Owner    Status   Location        Date
0         1    Jackhammer  James    Active     London  08/09/2020
1         1    Jackhammer  James    Active     London  08/10/2020
2         2  Cement_Mixer    Tim    Active   New_York  29/02/2020
3         3         Drill  Sarah    Active      Paris  11/02/2020
4         3         Drill  Sarah    Active      Paris  30/11/2020
5         3         Drill  Sarah    Active      Paris  21/12/2020
6         4        Ladder   Luke  Inactive  Hong_Kong         NaN
7         5         Winch   Kojo  Inactive     Sydney         NaN
8         6  Circular_Saw   Alex    Active     Moscow  19/06/2020
9         7        Hammer    Ken       NaN    Toronto  21/12/2020
10        8        Sander   Ezra       NaN  Frankfurt  19/06/2020

基于以下测试数据:

代码语言:javascript
运行
复制
df1= pd.read_csv(io.StringIO("""Item_ID Equipment     Owner Status   Location
1       Jackhammer    James Active   London
2       Cement_Mixer  Tim   Active   New_York
3       Drill         Sarah Active   Paris
4       Ladder        Luke  Inactive Hong_Kong
5       Winch         Kojo  Inactive Sydney
6       Circular_Saw  Alex  Active   Moscow"""), sep='\s+')


df2= pd.read_csv(io.StringIO("""Item_ID Equipment     Owner Date       Location
1       Jackhammer    James 08/09/2020 London
1       Jackhammer    James 08/10/2020 London
2       Cement_Mixer  NaN   29/02/2020 New_York
3       Drill         Sarah 11/02/2020 NaN
3       Drill         Sarah 30/11/2020 NaN
3       Drill         Sarah 21/12/2020 NaN
6       Circular_Saw  Alex  19/06/2020 Moscow
7       Hammer        Ken   21/12/2020 Toronto
8       Sander        Ezra  19/06/2020 Frankfurt"""), sep='\s+')
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64976628

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档