文章/答案/技术大牛

发布

社区首页 >问答首页 >检查一个数据帧中的单词是否出现在另一个数据帧中(python 3，pandas)

问检查一个数据帧中的单词是否出现在另一个数据帧中(python 3，pandas)
EN

Stack Overflow用户

提问于 2015-09-09 03:17:08

回答 3查看 1.3K关注 0票数 3

问题:我有两个数据帧，并且想要删除它们之间的任何重复项/部分重复项。

 DF1                 DF2

 **Phrases**         **Phrases**  
 Little Red          Little Red Corvette
 Grow Your           Grow Your Beans
 James Bond          James Dean
 Tom Brady

我想从DF1中删除“小红”和“长出你的”短语，然后将这两个DF组合在一起，这样最终的产品看起来就像：

 DF3
 Little Red Corvette
 Grow Your Beans
 James Bond
 James Dean
 Tom Brady

只需注意，如果所有单词都出现在DF2中的某个短语中，我只想从DF1中删除这些短语(例如，小红Vs.小红克尔维特)。如果"James Dean“出现在DF2中，我不想将"James Bond”从DF1中删除。

python-3.x

pandas

回答 3

Stack Overflow用户

发布于 2015-09-09 04:22:49

我在下面找到了这个解决方案。现在，它不是很优雅，但它是有效的。

import pandas as pd

df1 = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
df2 = pd.DataFrame(['Little Red Corvette', 'Grow Your Beans', 'James Dean'])

# For each element of df1, if we found a left(df2, len(df1)) = df1, we
# apply df1 = df2
# Remark that the column name is 0
for i in range(int(df1.count())):
    for j in range(int(df2.count())):
        if df1.loc[i, 0] == df2.loc[j, 0][:len(df1.loc[i, 0])]:
            df1.loc[i, 0] = df2.loc[j, 0]

# Finaly we merge df1 and df2 by union of the keys.
# Here the column name is 0
df3 = df2.merge(df1, how='outer', on=0, sort=True, copy=False)

DataFrame df3正是您所需要的。

票数 1

Stack Overflow用户

发布于 2015-09-09 05:22:23

您可以在排序后对值执行bisect操作：

import pandas as pd

df1 = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
df2 = pd.DataFrame(['Little Red Corvette', 'Grow Your Beans', 'James Dean'])

from bisect import bisect_left

def find_common(df1, df2):
    vals = df2.values
    vals.sort(0)
    for i, row in df1.iterrows():
        val = row.values
        ind = bisect_left(vals, val, hi=len(vals) - 1)
        if val[0] not in vals[ind][0]:
            yield val[0]


df3 = df2.append(pd.DataFrame(find_common(df1, df2)),ignore_index=True)
print(df3)

输出：

                     0
0      Grow Your Beans
1           James Dean
2  Little Red Corvette
3           James Bond
4            Tom Brady

排序为您提供了一个O(N log N)解决方案，而不是每次您从df1获得一个字符串校验时，df2都会遍历df2中的每个字符串

票数 1

Stack Overflow用户

发布于 2015-09-09 06:19:51

我将首先在数据帧上执行外部合并。我不确定在您的帖子中DF1是指列名还是数据帧变量名，但为了简单起见，我假设您有两个数据帧，它们的列都是字符串：

df1 
#        words
#0  little red
#1   grow your
#2  james bond
#3  tom brandy

df2 
#                 words
#0  little red corvette
#1      grow your beans
#2           james dean
#3               little

接下来，创建一个合并这两者的新数据帧(使用外部合并)。这会处理重复项

df3 = pandas.merge( df1, df2, on='words', how='outer')
#                 words
#0           little red
#1            grow your
#2           james bond
#3           tom brandy
#4  little red corvette
#5      grow your beans
#6           james dean
#7               little

接下来，您希望使用Series.str.get_dummies方法：

dummies = df3.words.str.get_dummies(sep='')
#   grow your  grow your beans  james bond  james dean  little  little red  \
#0          0                0           0           0       1           1   
#1          1                0           0           0       0           0   
#2          0                0           1           0       0           0   
#3          0                0           0           0       0           0   
#4          0                0           0           0       1           1   
#5          1                1           0           0       0           0   
#6          0                0           0           1       0           0   
#7          0                0           0           0       1           0   

#   little red corvette  tom brandy  
#0                    0           0  
#1                    0           0  
#2                    0           0  
#3                    0           1  
#4                    1           0  
#5                    0           0  
#6                    0           0  
#7                    0           0

注意，如果一个字符串在words列中不包含其他子字符串，或者如果它是一个或多个子字符串的超字符串，那么它的列的总和将为1 -否则它将总和为一个大于1的数字。现在，您可以使用此dummies数据帧来查找与这些子字符串对应的索引，并删除它们：

bad_rows = [where(df3.words==word)[0][0] 
            for word in list(dummies) 
            if dummies[word].sum() > 1 ]  # only substrings will sum to greater than 1
#[1, 7, 0]

df3.drop( df3.index[bad_rows] , inplace=True)
#                 words
#2           james bond
#3           tom brandy
#4  little red corvette
#5      grow your beans
#6           james dean

注意-这会处理一个超字符串中有多个子字符串的情况。例如，'little'和'little red'都是超字符串'little red corvette'的子字符串，所以我假设您只保留超字符串。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32465552

复制

相似问题

问检查一个数据帧中的单词是否出现在另一个数据帧中(python 3，pandas)
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问检查一个数据帧中的单词是否出现在另一个数据帧中(python 3，pandas)EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问检查一个数据帧中的单词是否出现在另一个数据帧中(python 3，pandas)
EN