文章/答案/技术大牛

发布

社区首页 >问答首页 >删除重复的列值，并根据pandas中的条件选择保留行

问删除重复的列值，并根据pandas中的条件选择保留行
EN

Stack Overflow用户

提问于 2020-05-16 06:16:47

回答 2查看 44关注 0票数 0

我有一个数据帧，例如：

COL1                         COL2           COL3     COL4       COL4bis     COL5  COL6 COL7  COL8     COL9  COL10 COL11  COL12           COL13
APE.1:8-9(+):Canis_lups      SEQ1            0.171    1041       243        0     436  1476  1485     194   487   1091   3.305000e-05    52
APE.1:8-9(+):Canis_lups      YP_SEQ1         0.171    1041       243        0     436  1476  1485     194   487   1091   3.305000e-05    52
APE.1:8-9(+):Canis_lups      SEQ2            0.20     1081       248        1     436  1476  1485     194   497   1091   0.305000e-08    51
APZ.1:1-1(-):Felis_catus     SEQ1            0.184     732       184        0      61   792  1071     233   458   1308   2.275000e-03    45
OKI:3946-7231(-):Ratus       SEQ3            0.185     852       203        0     388  1239  3285     194   443  1091   5.438000e-05    53
OKI:3946-7231(-):Ratus       XP_SEQ3         0.185     852       203        0     388  1239  3285     194   443  1091   5.438000e-05    53

我想删除具有完全相同的COL1, COL3:COL13值的行(除了COL2)，并且为了知道我保留了哪个COL2，我将具有prefix的行保留在一个列表中：

`prefix_list =['AC_','NC_',"YP_"]

如果前缀列表中没有前缀，我会保留第一个前缀。在此示例中，预期结果为：

APE.1:8-9(+):Canis_lups      YP_SEQ1         0.171    1041       243        0     436  1476  1485     194   487   1091   3.305000e-05    52
APE.1:8-9(+):Canis_lups      SEQ2            0.20     1081       248        1     436  1476  1485     194   497   1091   0.305000e-08    51
APZ.1:1-1(-):Felis_catus     SEQ1            0.184     732       184        0      61   792  1071     233   458   1308   2.275000e-03    45
OKI:3946-7231(-):Ratus       XP_SEQ3         0.185     852       203        0     388  1239  3285     194   443  1091   5.438000e-05    53

pandas

dataframe

python

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-05-16 07:23:04

如果我理解正确的话，这应该能起到作用：

import pandas as pd

#NOTE: i've only created a dataframe with 6 columns, but the code still applies to your dataframe of 13 columns
#data
d = {'COL1': ['APE.1:8-9(+):Canis_lups', 'APE.1:8-9(+):Canis_lups', 'APE.1:8-9(+):Canis_lups', 'APZ.1:1-1(-):Felis_catus', 'OKI:3946-7231(-):Ratus', 'OKI:3946-7231(-):Ratus'],
 'COL2': ['SEQ1', 'YP_SEQ1', 'SEQ1', 'SEQ1', 'SEQ3', 'XP_SEQ3'],
 'COL3': [0.171, 0.171, 0.20, 0.184, 0.185, 0.185],
 'COL4': [243, 243, 248, 184, 203, 203],
 'COL5': [0, 0, 1, 0, 0, 0],
 'COL6': [436, 436, 436, 61, 388, 388]}

#create data frame
df = pd.DataFrame(data = d)

#list of substrings
prefix_list =['AC_','NC_',"YP_"]
#list of columns to group
groupingColumns = [c for c in df if c is not 'COL2']
#create check column
df['prefix_check'] = 0
#flag the check column with 1 if substrings in the list appear in column 2
for item in prefix_list:
    df['prefix_check'] = df['COL2'].apply(lambda x: 1 if (df['prefix_check'] > 0).any() else (1 if item in x else 0))
#sort dataframe (asc=False)
df = df.sort_values(by=df.columns.tolist(), ascending=False)
#drop duplicates based on other columns and keep first value (this will keep the one where the flag check is 1)
output = df.drop_duplicates(subset=groupingColumns, keep='first').reset_index(drop = True)
#remove check column
output = output.drop(['prefix_check'], axis=1)

print(output)

                       COL1     COL2   COL3  COL4  COL5  COL6 ..........
0    OKI:3946-7231(-):Ratus  XP_SEQ3  0.185   203     0   388 ..........
1  APZ.1:1-1(-):Felis_catus     SEQ1  0.184   184     0    61 ..........
2   APE.1:8-9(+):Canis_lups  YP_SEQ1  0.171   243     0   436 ..........
3   APE.1:8-9(+):Canis_lups     SEQ1  0.200   248     1   436 ..........

票数 1

Stack Overflow用户

发布于 2020-05-16 07:10:02

#List
prefix_list =['AC_','NC_',"YP_"]
#collapse list into string
s="|".join(prefix_list)
#create a dtaframe with those meeting list criteria
df2=df[df.COL2.str.contains(s)]
#drop duplicates in the original datframe but excluding COL2
df.drop_duplicates(subset=df.columns.difference(['COL2']), inplace=True)
#cOMBINE THE TWO LISTS. Here I combine on two fields that should not have duplicates
df4 = (df2.set_index(['COL1','COL3']).combine_first(df.set_index(['COL1','COL3'])).reset_index())
#df4=pd.merge([df,df2], left_on='COL1',right_on='COL1', how='outer')

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61829210

复制

相似问题

问删除重复的列值，并根据pandas中的条件选择保留行
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问删除重复的列值，并根据pandas中的条件选择保留行EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问删除重复的列值，并根据pandas中的条件选择保留行
EN