我有两个DataFrames。一个包含几个发电厂以及它们各自按经度和纬度的位置,每一列都在一列中。另一个数据帧包含多个变电站,也具有long和lat。我想做的是将发电厂分配给离我最近的变电站。
df1 = pd.DataFrame{'ID_pp':['p1','p2','p3','p4'],'x':[12.644881,11.563269, 12.644881, 8.153184], 'y':[48.099206, 48.020081, 48.099206, 49.153766]}
df2 = pd.DataFrame{'ID_ss':['s1','s2','s3','s4'],'x':[9.269, 9.390, 9.317, 10.061], 'y':[55.037, 54.940, 54.716, 54.349]}
我想我需要计算所有点之间的距离,然后对数据帧进行分组,但我不确定如何进行。我找到了numpy.linalg.norm()函数,但它并不适合我。任何帮助都是非常感谢的。
我找到了这个解决方案,这基本上就是我需要的:
import pandas as pd
import geopy.distance
for i,row in test.iterrows(): # A
df1 = row.x, row.y
distances = []
for j,row2 in df2.iterrows(): # B
b = row2.x, row2.y
distances.append(geopy.distance.geodesic(a, b).km)
min_distance = min(distances)
min_index = distances.index(min_distance)
print("A", i, "is closest to B", min_index, min_distance, "km")
它是有效的,但它永远需要花费时间,而且我的数据集非常大。我认为使用.apply的方法可能会更快。有人知道如何将这种方法应用到应用方法中吗?
发布于 2021-11-03 20:58:59
这是一个使用geopandas
的解决方案。对于更大的数据集,我不知道它的规模有多大。
import geopandas as gpd
import pandas as pd
df1 = pd.DataFrame({'ID_pp':['p1','p2','p3','p4'],'x':[12.644881,11.563269, 12.644881, 8.153184], 'y':[48.099206, 48.020081, 48.099206, 49.153766]})
df2 = pd.DataFrame({'ID_ss':['s1','s2','s3','s4'],'x':[9.269, 9.390, 9.317, 10.061], 'y':[55.037, 54.940, 54.716, 54.349]})
# create GeoDataFrames from the original dfs
gdf1 = gpd.GeoDataFrame(df1[['ID_pp']], geometry=gpd.points_from_xy(df1['x'], df1['y']), crs='EPSG:4326')
gdf2 = gpd.GeoDataFrame(df2[['ID_ss']], geometry=gpd.points_from_xy(df2['x'], df2['y']), crs='EPSG:4326')
# convert to another coordinate reference system for units in metres, EPSG:5243 suits Germany as far as I know
gdf1 = gdf1.to_crs('EPSG:5243')
gdf2 = gdf2.to_crs('EPSG:5243')
gdf2 = gdf2.set_index('ID_ss')
def get_closest_ss(point, other):
s = other.distance(point)
return (s.idxmin(), s.min())
# find ID of closest substation to all power plants
gdf1[['closest_ss', 'distance']] = gdf1.geometry.apply(get_closest_ss, args=(gdf2,)).to_list()
# merge the dataframe with the power plants (gdf1) with the closest substation (gdf2)
gdf = gdf1.merge(gdf2, left_on='closest_ss', right_index=True, suffixes=('', '_ss'))
print(gdf)
# output
ID_pp geometry closest_ss distance \
0 p1 POINT (159807.847 -320153.333) s4 717896.945731
1 p2 POINT (79356.344 -330713.037) s4 711534.096071
2 p3 POINT (159807.847 -320153.333) s4 717896.945731
3 p4 POINT (-171106.060 -202478.708) s4 592470.679838
geometry_ss
0 POINT (-28563.516 372589.227)
1 POINT (-28563.516 372589.227)
2 POINT (-28563.516 372589.227)
3 POINT (-28563.516 372589.227)
https://stackoverflow.com/questions/69822240
复制相似问题