我有两个数据帧
print(df1)
Name df1 RT [min] Molecular Weight RT [min]+0.2 RT [min]-0.2 Molecular Weight + 0.2 Molecular Weight - 0.2
0 unknow compound 1 7.590 194.04212 7.790 7.390 194.24212 193.84212
1 unknow compound 2 7.510 194.15000 7.710 7.310 194.35000 193.95000
2 unknow compound 3 7.410 194.04209 7.610 7.210 194.24209 193.84209
3 unknow compound 4 7.434 342.11615 7.634 7.234 342.31615 341.91615
4 unknow compound 5 0.756 176.03128 0.956 0.556 176.23128 175.83128 和
print(df2)
Name df2 Molecular Weight RT [min]
0 β-D-Glucopyranuronic acid 194.04220 7.483
1 α,α-Trehalose 194.10000 7.350
2 Threonylserine 206.08970 8.258
3 Terephthalic acid 166.02595 7.465
4 Sulfuric acid 97.96714 8.909如果满足两个条件,我希望将df2中的行合并到df1中的行。
df1
这意味着,如果df2中的一行符合来自df1的另外两行的条件,则df1行将被复制。
所以df3应该看上去
print(df3)
Name df1 RT [min]+0.2 RT [min]-0.2 Molecular Weight + 0.2 Molecular Weight - 0.2 Name df2 Molecular Weight RT [min]
0 unknow compound 1 7.790 7.390 194.24212 193.84212 β-D-Glucopyranuronic acid 194.0422 7.483
1 unknow compound 1 7.790 7.390 194.24212 193.84212 α,α-Trehalose 194.1000 7.350
2 unknow compound 2 7.710 7.310 194.35000 193.95000 β-D-Glucopyranuronic acid 194.0422 7.483
3 unknow compound 3 8.310 7.910 206.30000 205.90000 Threonylserine 206.0897 8.258
4 unknow compound 4 7.634 7.234 342.31615 341.91615 NaN NaN NaN
5 unknow compound 5 0.956 0.556 176.23128 175.83128 NaN NaN NaN df2中的第一行满足了来自df1的未知化合物1、和未知化合物2的两个条件,所以在df3中有两次。
df2中的第二行只满足未知化合物1的两个条件。
df2中的第三行只满足未知化合物3的2个条件。
所有其他行都不符合df1的任何条件。
我试着用How to join two data frames for which column values are within a certain range? (第一个答案)来做
import pandas as pd
df_1 = pd.read_excel (r'D:\CD SandBox\df1.xlsx')
df_2 = pd.read_excel (r'D:\CD SandBox\df2.xlsx')
df2.index = pd.IntervalIndex.from_arrays(df2['RT [min]-0.2'],df2['RT [min]+0.2'],closed='both')
df2['RT [min]'] = df2['RT [min]'].apply( lambda x : df2.iloc[df1.index.get_loc(x)])但是不知道如何处理第二行代码,并收到以下错误:
df2['RT [min]'] = df2['RT [min]'].apply( lambda x : df2.iloc[df1.index.get_loc(x)])
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\BCDD\Anaconda3\envs\PTSD\lib\site-packages\pandas\core\series.py", line 4213, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\_libs\lib.pyx", line 2403, in pandas._libs.lib.map_infer
File "<input>", line 1, in <lambda>
File "C:\Users\BCDD\Anaconda3\envs\PTSD\lib\site-packages\pandas\core\indexes\interval.py", line 730, in get_loc
raise KeyError(key)
KeyError: 8.258编辑:尝试使用merge_asof
根据How to join two DataFrames with multiple overlapping timestamps using an extra shared variable
df2 = df2.drop(['RT [min]', 'Molecular Weight'], axis=1)
df2['RT [min]']=df2['RT [min]-0.2']
pd.merge_asof(df2[['RT [min]','Name df2']] , df1,on='RT [min]',direction ='forward',allow_exact_matches =True)
...
RT [min] Name df2 Name df1 Molecular Weight
0 0.556 unknow compound 5 α,α-Trehalose 194.10000
1 7.210 unknow compound 3 α,α-Trehalose 194.10000
2 7.234 unknow compound 4 α,α-Trehalose 194.10000
3 7.310 unknow compound 2 α,α-Trehalose 194.10000
4 7.390 unknow compound 1 Terephthalic acid 166.02595为表提供错误的匹配。
任何想法\提示都将不胜感激。
发布于 2020-12-27 13:56:50
备选案文1
如果您使用的是熊猫1.2.0,您可以创建两个数据仓库的笛卡儿产品,然后检查条件。另外,由于您不需要RT [min]和Molecular Weight,所以我假设您已经删除了它们:
df3 = df1.merge(df2, how = 'cross', suffixes = [None,None])
#check if 'Molecular Weight' is in the interval:
mask1 = df3['Molecular Weight'].ge(df3['Molecular Weight - 0.2']) & df3['Molecular Weight'].le(df3['Molecular Weight + 0.2'])
#check if 'RT [min]' is in the interval
mask2 = df3['RT [min]'].ge(df3['RT [min]-0.2']) & df3['RT [min]'].le(df3['RT [min]+0.2'])
df3 = df3[mask1 & mask2].reset_index(drop = True)输出:
df3
Name df1 RT [min]+0.2 RT [min]-0.2 ... Name df2 Molecular Weight RT [min]
0 unknow compound 1 7.79 7.39 ... β-D-Glucopyranuronic acid 194.0422 7.483
1 unknow compound 2 7.71 7.31 ... β-D-Glucopyranuronic acid 194.0422 7.483
2 unknow compound 2 7.71 7.31 ... α,α-Trehalose 194.1000 7.350
3 unknow compound 3 7.61 7.21 ... β-D-Glucopyranuronic acid 194.0422 7.483
4 unknow compound 3 7.61 7.21 ... α,α-Trehalose 194.1000 7.350选项2
由于您的数据相当大,您可以使用生成器来避免加载整个结果数据。同样,我假设您将RT [min]和Molecular Weight从df1中删除。
import numpy as np
from itertools import product
def df_iter(df1,df2):
for row1, row2 in product(df1.values, df2.values):
# RT [min]-0.2 <= RT [min] <= RT [min]+0.2
if row1[2] <= row2[2] <= row1[1]:
#Molecular Weight - 0.2 <= Molecular Weight <= Molecular Weight + 0.2
if row1[4] <= row2[1] <= row1[3]:
yield np.concatenate((row1,row2))
df3_rows = df_iter(df1,df2)然后可以操作行:
for row in df3_rows:
print(row)输出:
['unknow compound 1' 7.79 7.39 194.24212 193.84212 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'α,α-Trehalose' 194.1 7.35]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'α,α-Trehalose' 194.1 7.35]或者创建一个dataframe:
df3 = pd.DataFrame(data = list(df3_rows),
columns = np.concatenate((df1.columns, df2.columns)))这将产生与选项1相同的数据格式。
NOTE1:要小心使用函数df_iter中的条件项中的索引,这些条件项在my case中工作。
NOTE2:,我很确定您的数据与示例df3不匹配。
https://stackoverflow.com/questions/65464300
复制相似问题