我有一个数据帧,其中包含各种对象的位置信息,以及每个对象的唯一索引(本例中的索引与数据帧无关)。以下是一些示例数据:
ind pos
x y z
-1.0 7.0 0.0 21 [-2.76788330078, 217.786453247, 26.6822681427]
0.0 22 [-7.23852539062, 217.274139404, 26.6758270264]
0.0 1.0 152 [-0.868591308594, 2.48404550552, 48.4036369324]
6.0 2.0 427 [-0.304443359375, 182.772140503, 79.4475860596]
实际的数据帧相当长。我已经写了一个函数,它接受两个向量作为输入,并输出它们之间的距离:
def dist(a, b):
diff = N.array(a)-N.array(b)
d = N.sqrt(N.dot(diff, diff))
return d
还有一个函数,在给定两个数组的情况下,将输出这两个数组之间所有唯一的元素组合:
def getPairs(a, b):
if N.array_equal(a, b):
pairs = [(a[i], b[j]) for i in range(len(a)) for j in range(i+1,
len(b))]
else:
pairs = [(a[i], b[j]) for i in range(len(a)) for j in range(len(b))]
return pairs
我想要获取我的数据帧,并找到它们之间的距离小于某个值的所有元素对,比如30。对于满足此要求的对,我还需要存储我在其他数据帧中计算的距离。这是我尝试解决这个问题的方法,但结果证明速度非常慢。
pairs = [getPairs(list(group.ind), list(boxes.get_group((name[0]+i, name[1]+j, name[2]+k)).ind)) \
for i in [0,1] for j in [0,1] for k in [0,1] if name[0]+i != 34 and name[1]+j != 34 and name[2]+k != 34]
pairs = list(itertools.chain(*pairs))
subInfo = pandas.DataFrame()
subInfo['pairs'] = pairs
subInfo['r'] = subInfo.pairs.apply(lambda x: dist(df_yz.query('ind == @x[0]').pos[0], df_yz.query('ind == @x[1]').pos[0]))
不用担心我在这个for循环中迭代了什么,它适用于我正在处理的系统,而不是我被减慢的地方。我使用.query()的步骤是主要阻塞发生的地方。
我正在寻找的输出类似于:
pair distance
(21, 22) 22.59
(21, 152) 15.01
(22, 427) 19.22
我虚构了距离,配对列表会更长,但这是基本的想法。
发布于 2018-07-19 04:49:34
我花了一段时间,但这是你可能的解决方案。希望它们是不言而喻的。在Jupyter Notebook中用Python 3.x编写。注意:如果你的坐标是世界坐标,你可以考虑使用Haversine距离(圆形距离)而不是欧几里德距离,欧几里德距离是一条直线。
首先,创建数据
import pandas as pd
import numpy as np
values = [
{ 'x':-1.0, 'y':7.0, 'z':0.0, 'ind':21, 'pos':[-2.76788330078, 217.786453247, 26.6822681427] },
{ 'z':0.0, 'ind':22, 'pos':[-7.23852539062, 217.274139404, 26.6758270264] },
{ 'y':0.0, 'z':1.0, 'ind':152, 'pos':[-0.868591308594, 2.48404550552, 48.4036369324] },
{ 'y':6.0, 'z':2.0, 'ind':427, 'pos':[-0.304443359375, 182.772140503, 79.4475860596] }
]
def dist(a, b):
"""
Calculates the Euclidean distance between two 3D-vectors.
"""
diff = np.array(a) - np.array(b)
d = np.sqrt(np.dot(diff, diff))
return d
df_initial = pd.DataFrame(values)
以下三个解决方案将生成此输出:
pairs distance
1 (21, 22) 4.499905
3 (21, 427) 63.373886
7 (22, 427) 63.429709
First solution基于数据与自身的完全联接。缺点是,如果数据集很大,它可能会超出您的内存。优点是代码的易读性和仅Pandas的使用:
#%%time
df = df_initial.copy()
# join data with itself, each line will contain two geo-positions
df['tmp'] = 1
df = df.merge(df, on='tmp', suffixes=['1', '2']).drop('tmp', axis=1)
# remove rows with similar index
df = df[df['ind1'] != df['ind2']]
# calculate distance for all
df['distance'] = df.apply(lambda row: dist(row['pos1'], row['pos2']), axis=1)
# filter only those within a specific distance
df = df[df['distance'] < 70]
# combine original indices into a tuple
df['pairs'] = list(zip(df['ind1'], df['ind2']))
# select columns of interest
df = df[['pairs', 'distance']]
def sort_tuple(idx):
x, y = idx
if y < x:
return y, x
return x, y
# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)
# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)
# print result
df
第二个解决方案试图通过逐行迭代原始数据并计算当前行与原始数据之间的距离,同时只保留满足最小距离约束的值来避免第一个版本的内存问题。我预计会有一个糟糕的性能,但一点也不差(参见最后的总结)。
#%%time
df = df_initial.copy()
results = list()
for index, row1 in df.iterrows():
# calculate distance between current coordinate and all original rows in the data
df['distance'] = df.apply(lambda row2: dist(row1['pos'], row2['pos']), axis=1)
# filter only those within a specific distance and drop rows with same index as current coordinate
df_tmp = df[(df['distance'] < 70) & (df['ind'] != row1['ind'])].copy()
# prepare final data
df_tmp['ind2'] = row1['ind']
df_tmp['pairs'] = list(zip(df_tmp['ind'], df_tmp['ind2']))
# remember data
results.append(df_tmp)
# combine all into one dataframe
df = pd.concat(results)
# select columns of interest
df = df[['pairs', 'distance']]
def sort_tuple(idx):
x, y = idx
if y < x:
return y, x
return x, y
# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)
# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)
# print result
df
是第三种解决方案,它基于使用Scipy的KDTree进行的空间操作。
#%%time
from scipy import spatial
tree = spatial.KDTree(list(df_initial['pos']))
# calculate distances (returns a sparse matrix)
distances = tree.sparse_distance_matrix(tree, max_distance=70)
# convert to a Coordinate (coo) representation of the Compresses-Sparse-Column (csc) matrix.
coo = distances.tocoo(copy=False)
def get_cell_value(idx: int, column: str = 'ind'):
return df_initial.iloc[idx][column]
def extract_indices(row):
distance, idx1, idx2 = row
return get_cell_value(int(idx1)), get_cell_value(int(idx2))
df = pd.DataFrame({'idx1': coo.row, 'idx2': coo.col, 'distance': coo.data})
df['pairs'] = df.apply(extract_indices, axis=1)
# select columns of interest
df = df[['pairs', 'distance']]
def sort_tuple(idx):
x, y = idx
if y < x:
return y, x
return x, y
# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)
# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)
# print result
df
那么性能又如何呢?如果您只想知道原始数据的哪一行在期望的距离内,那么KDTree版本(第三版)非常快。生成稀疏矩阵只需要4毫秒。但由于我随后使用该矩阵中的索引从原始数据中提取数据,因此性能下降。当然,这应该在您的完整数据集上进行测试。
https://stackoverflow.com/questions/51409927
复制相似问题