我有一个包含两个地址信息的数据集,我需要对它们进行比较,以评估它们是否包含相同的数字或数字集。
这是我的数据集
data = [['Road 76', 'Road 12, 55'], ['Road 11, 7-9', 'Road 11, 5'], ['Road 25', 'Street 5']]
df_original = pd.DataFrame( data, columns = ['Address 1', 'Address 2'])
这就是结果
test_data = [['Road 76', 'Road 12, 55', 0], ['7-9, Road 11', 'Road 11, 5', 1], ['Road 5', 'Street 25', 0]]
df_outcome = pd.DataFrame(test_data, columns = ['Address 1', 'Address 2', 'Number Match?'])
df_outcome
这是我的尝试,但它只考虑列中出现的第一个数字
df_original['Address 1'] = df_original['Address 1'].str.extract('(\d+)')
df_original['Address 2'] = df_original['Address 2'].str.extract('(\d+)')
df_original['Number match'] = np.where(df_original['Address 1']==df_original['Address 2'], 1, 0)
有什么建议吗?
发布于 2021-09-22 05:57:33
首先通过Series.str.findall
获取所有整数,将值转换为集合,对于交集使用&
,最后转换为布尔值以映射True->1, False->0
df_original['Address 1'] = df_original['Address 1'].str.findall('(\d+)').apply(set)
df_original['Address 2'] = df_original['Address 2'].str.findall('(\d+)').apply(set)
df_original['Number match'] = (df_original['Address 1'] & df_original['Address 2']).astype(int)
print (df_original)
Address 1 Address 2 Number match
0 {76} {55, 12} 0
1 {9, 7, 11} {5, 11} 1
2 {25} {5} 0
https://stackoverflow.com/questions/69278818
复制相似问题