Python-从数据框中选择对象对

内容来源于 Stack Overflow,并遵循CC BY-SA 3.0许可协议进行翻译与使用

  • 回答 (1)
  • 关注 (0)
  • 查看 (198)

我有一个数据框,其中包含有关各种对象的位置的信息,以及每个对象的唯一索引(在这种情况下,索引与数据框无关)。以下是一些示例数据:

                     ind    pos
   x    y    z      
-1.0    7.0  0.0      21    [-2.76788330078, 217.786453247, 26.6822681427]
             0.0      22    [-7.23852539062, 217.274139404, 26.6758270264]
        0.0  1.0      152   [-0.868591308594, 2.48404550552, 48.4036369324]
        6.0  2.0      427   [-0.304443359375, 182.772140503, 79.4475860596]

实际的数据框架很长。我编写了一个函数,它将两个向量作为输入并输出它们之间的距离:

def dist(a, b):
    diff = N.array(a)-N.array(b)    
    d = N.sqrt(N.dot(diff, diff))
    return d

和给定两个数组的函数将输出这些数组之间所有唯一的元素组合:

def getPairs(a, b):
    if N.array_equal(a, b):
        pairs = [(a[i], b[j]) for i in range(len(a)) for j in range(i+1, 
        len(b))]
    else:
        pairs = [(a[i], b[j]) for i in range(len(a)) for j in range(len(b))]
    return pairs

我想获取我的数据框并找到它们之间的距离小于某个值的所有元素对,比如30.对于满足此要求的对,我还需要存储我在其他数据框中计算的距离。这是我尝试解决这个问题,但结果却非常缓慢。

pairs = [getPairs(list(group.ind), list(boxes.get_group((name[0]+i, name[1]+j, name[2]+k)).ind)) \
    for i in [0,1] for j in [0,1] for k in [0,1] if name[0]+i != 34 and name[1]+j != 34 and name[2]+k != 34]



pairs = list(itertools.chain(*pairs))

subInfo = pandas.DataFrame()
subInfo['pairs'] = pairs

subInfo['r'] = subInfo.pairs.apply(lambda x: dist(df_yz.query('ind == @x[0]').pos[0], df_yz.query('ind == @x[1]').pos[0]))

不要担心我在这个for循环中迭代什么,它适用于我正在处理的系统,而不是我放慢速度的地方。我使用的步骤.query()是主要卡纸发生的地方。

我正在寻找的输出是这样的:

pair          distance
(21, 22)      22.59
(21, 152)     15.01
(22, 427)     19.22

我把距离提高了,对列表会更长,但那是基本的想法。

提问于
用户回答回答于

花了我一会儿,但这是你可能的解决方案。希望他们是自我解释的。在Jupyter Notebook中用Python 3.x编写。一句话:如果你的坐标是世界坐标,你可能会想到使用Haversine距离(圆形距离)而不是欧几里德距离这是一条直线。

首先,创建您的数据

import pandas as pd
import numpy as np

values = [
    { 'x':-1.0, 'y':7.0, 'z':0.0, 'ind':21, 'pos':[-2.76788330078, 217.786453247, 26.6822681427] },
    { 'z':0.0, 'ind':22, 'pos':[-7.23852539062, 217.274139404, 26.6758270264] },
    { 'y':0.0, 'z':1.0, 'ind':152, 'pos':[-0.868591308594, 2.48404550552, 48.4036369324] },
    { 'y':6.0, 'z':2.0, 'ind':427, 'pos':[-0.304443359375, 182.772140503, 79.4475860596] }
]

def dist(a, b):
    """
    Calculates the Euclidean distance between two 3D-vectors.
    """
    diff = np.array(a) - np.array(b)    
    d = np.sqrt(np.dot(diff, diff))
    return d


df_initial = pd.DataFrame(values)

以下三个解决方案将生成此输出:

    pairs   distance
1   (21, 22)    4.499905
3   (21, 427)   63.373886
7   (22, 427)   63.429709

第一种解决方案基于数据与其自身的完全连接。缺点是如果数据集很大,它可能会超出你的记忆。优点是代码易于阅读和仅使用Pandas:

#%%time 

df = df_initial.copy()

# join data with itself, each line will contain two geo-positions
df['tmp'] = 1
df = df.merge(df, on='tmp', suffixes=['1', '2']).drop('tmp', axis=1)

# remove rows with similar index
df = df[df['ind1'] != df['ind2']]

# calculate distance for all
df['distance'] = df.apply(lambda row: dist(row['pos1'], row['pos2']), axis=1)

# filter only those within a specific distance
df = df[df['distance'] < 70]

# combine original indices into a tuple
df['pairs'] = list(zip(df['ind1'], df['ind2']))

# select columns of interest
df = df[['pairs', 'distance']]

def sort_tuple(idx):
    x, y = idx
    if y < x:
        return y, x
    return x, y

# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)

# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)

# print result
df

第二种解决方案试图通过逐行迭代原始数据并计算当前行与原始数据之间的距离同时仅保持满足最小距离约束的值来避免第一版本的存储器问题。我期待一个糟糕的表现,但一点都不差(最后见摘要)。

#%%time

df = df_initial.copy()

results = list()
for index, row1 in df.iterrows():
    # calculate distance between current coordinate and all original rows in the data
    df['distance'] = df.apply(lambda row2: dist(row1['pos'], row2['pos']), axis=1)

    # filter only those within a specific distance and drop rows with same index as current coordinate
    df_tmp = df[(df['distance'] < 70) & (df['ind'] != row1['ind'])].copy()

    # prepare final data
    df_tmp['ind2'] = row1['ind']
    df_tmp['pairs'] = list(zip(df_tmp['ind'], df_tmp['ind2']))

    # remember data
    results.append(df_tmp)

# combine all into one dataframe
df = pd.concat(results)

# select columns of interest
df = df[['pairs', 'distance']]

def sort_tuple(idx):
    x, y = idx
    if y < x:
        return y, x
    return x, y

# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)

# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)

# print result
df

第三种解决方案基于使用Scipy 的KDTree进行空间操作。

#%%time
from scipy import spatial

tree = spatial.KDTree(list(df_initial['pos']))

# calculate distances (returns a sparse matrix)
distances = tree.sparse_distance_matrix(tree, max_distance=70)

# convert to a Coordinate (coo) representation of the Compresses-Sparse-Column (csc) matrix.
coo = distances.tocoo(copy=False)

def get_cell_value(idx: int, column: str = 'ind'):
    return df_initial.iloc[idx][column]

def extract_indices(row):
    distance, idx1, idx2 = row
    return get_cell_value(int(idx1)), get_cell_value(int(idx2))

df = pd.DataFrame({'idx1': coo.row, 'idx2': coo.col, 'distance': coo.data})
df['pairs'] = df.apply(extract_indices, axis=1)

# select columns of interest
df = df[['pairs', 'distance']]

def sort_tuple(idx):
    x, y = idx
    if y < x:
        return y, x
    return x, y

# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)

# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)

# print result
df

性能如何呢?如果您只是想知道原始数据的哪一行在所需的距离内,那么KDTree版本(第三版)非常快。生成稀疏矩阵只需要4ms。但是,由于我使用该矩阵中的索引从原始数据中提取数据,因此性能下降。当然,这应该在您的完整数据集上进行测试。

  • 版本1:93.4毫秒
  • 版本2:42.2 ms
  • 版本3:52.3毫秒(4毫秒)

扫码关注云+社区

领取腾讯云代金券

年度创作总结 领取年终奖励