问题:我有两个数据帧,并且想要删除它们之间的任何重复项/部分重复项。
DF1 DF2
**Phrases** **Phrases**
Little Red Little Red Corvette
Grow Your Grow Your Beans
James Bond James Dean
Tom Brady
我想从DF1中删除“小红”和“长出你的”短语,然后将这两个DF组合在一起,这样最终的产品看起来就像:
DF3
Little Red Corvette
Grow Your Beans
James Bond
James Dean
Tom Brady
只需注意,如果所有单词都出现在DF2中的某个短语中,我只想从DF1中删除这些短语(例如,小红Vs.小红克尔维特)。如果"James Dean“出现在DF2中,我不想将"James Bond”从DF1中删除。
发布于 2015-09-09 04:22:49
我在下面找到了这个解决方案。现在,它不是很优雅,但它是有效的。
import pandas as pd
df1 = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
df2 = pd.DataFrame(['Little Red Corvette', 'Grow Your Beans', 'James Dean'])
# For each element of df1, if we found a left(df2, len(df1)) = df1, we
# apply df1 = df2
# Remark that the column name is 0
for i in range(int(df1.count())):
for j in range(int(df2.count())):
if df1.loc[i, 0] == df2.loc[j, 0][:len(df1.loc[i, 0])]:
df1.loc[i, 0] = df2.loc[j, 0]
# Finaly we merge df1 and df2 by union of the keys.
# Here the column name is 0
df3 = df2.merge(df1, how='outer', on=0, sort=True, copy=False)
DataFrame df3正是您所需要的。
发布于 2015-09-09 05:22:23
您可以在排序后对值执行bisect
操作:
import pandas as pd
df1 = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
df2 = pd.DataFrame(['Little Red Corvette', 'Grow Your Beans', 'James Dean'])
from bisect import bisect_left
def find_common(df1, df2):
vals = df2.values
vals.sort(0)
for i, row in df1.iterrows():
val = row.values
ind = bisect_left(vals, val, hi=len(vals) - 1)
if val[0] not in vals[ind][0]:
yield val[0]
df3 = df2.append(pd.DataFrame(find_common(df1, df2)),ignore_index=True)
print(df3)
输出:
0
0 Grow Your Beans
1 James Dean
2 Little Red Corvette
3 James Bond
4 Tom Brady
排序为您提供了一个O(N log N)
解决方案,而不是每次您从df1获得一个字符串校验时,df2都会遍历df2中的每个字符串
发布于 2015-09-09 06:19:51
我将首先在数据帧上执行外部合并。我不确定在您的帖子中DF1
是指列名还是数据帧变量名,但为了简单起见,我假设您有两个数据帧,它们的列都是字符串:
df1
# words
#0 little red
#1 grow your
#2 james bond
#3 tom brandy
df2
# words
#0 little red corvette
#1 grow your beans
#2 james dean
#3 little
接下来,创建一个合并这两者的新数据帧(使用外部合并)。这会处理重复项
df3 = pandas.merge( df1, df2, on='words', how='outer')
# words
#0 little red
#1 grow your
#2 james bond
#3 tom brandy
#4 little red corvette
#5 grow your beans
#6 james dean
#7 little
接下来,您希望使用Series.str.get_dummies
方法:
dummies = df3.words.str.get_dummies(sep='')
# grow your grow your beans james bond james dean little little red \
#0 0 0 0 0 1 1
#1 1 0 0 0 0 0
#2 0 0 1 0 0 0
#3 0 0 0 0 0 0
#4 0 0 0 0 1 1
#5 1 1 0 0 0 0
#6 0 0 0 1 0 0
#7 0 0 0 0 1 0
# little red corvette tom brandy
#0 0 0
#1 0 0
#2 0 0
#3 0 1
#4 1 0
#5 0 0
#6 0 0
#7 0 0
注意,如果一个字符串在words
列中不包含其他子字符串,或者如果它是一个或多个子字符串的超字符串,那么它的列的总和将为1 -否则它将总和为一个大于1的数字。现在,您可以使用此dummies
数据帧来查找与这些子字符串对应的索引,并删除它们:
bad_rows = [where(df3.words==word)[0][0]
for word in list(dummies)
if dummies[word].sum() > 1 ] # only substrings will sum to greater than 1
#[1, 7, 0]
df3.drop( df3.index[bad_rows] , inplace=True)
# words
#2 james bond
#3 tom brandy
#4 little red corvette
#5 grow your beans
#6 james dean
注意-这会处理一个超字符串中有多个子字符串的情况。例如,'little'
和'little red'
都是超字符串'little red corvette'
的子字符串,所以我假设您只保留超字符串。
https://stackoverflow.com/questions/32465552
复制相似问题