我有两个数据集。Dataset %1包含具有range start
、range end
和variable Y
的列。Dataset 2包含具有range start
、range end
、variable A
、variable B
和variable C
的列。
我想要比较两个数据集中的两组范围,并创建一个新的数据集,该数据集具有两列的范围重叠(即开始范围重叠、结束范围重叠)以及此重叠的所有变量(即variable Y
、variable A
、variable B
、variable C
)
我是R的新手,所以我有点困惑,不知道如何去做,甚至不知道如何正确地解释它,但这里有一个我认为可以解释它的例子。
Dataset 1:
Start range (96.98, 97.02, 97.06)
End range (97.005, 97.05, 97.095)
Variable Y (1.48, 0.42, 4.78)
Dataset 2:
start range(96.95, 97, 97.05)
end range(97, 97.05, 97.1)
Variable A (100, 50, 10)
Variable B (0, 30, 30)
New Dataset 3:
Start range (96.95, 96.98, 97, 97.005, 97.02, 97.05, 97.06, 97.095)
end range (96.98, 97, 97.005, 97.02, 97.05, 97.06, 97.095, 97.1)
Variable Y (NA, 1.48, 1.48, NA, 0.42, NA, 4.78, NA)
Variable A (100, 100, 50, 50, 50, 10, 10, 10)
Variable B (0, 0, 30, 30, 30, 30, 30, 30)
*注意,NA不是值-在这种情况下,我仍然希望包含不重叠的列。
发布于 2019-07-15 09:22:17
如果您只想要重叠的范围,这将很容易:例如,可以使用sqldf
将其编写为SQL join。
library(sqldf)
sqldf("
SELECT MAX(d1.start, d2.start) AS start,
MIN(d1.end, d2.end) AS end,
d1.start AS start1,
d1.end AS end1,
d2.start AS start2,
d2.end AS end2,
d1.Y, d2.A, d2.B, d2.C
FROM d1, d2
WHERE d1.start <= d2.end AND d2.start <= d1.end
")
如果您还想要在其上没有重叠的区间,则需要更复杂;特别是,给定的区间可以有几个没有重叠的子区间。一种解决方案是首先通过收集所有端点来计算所有这些子间隔。
dates <- sort( unique( c( d1$start, d1$end, d2$start, d2$end ) ) )
d <- data.frame(
start = dates[-length(dates)],
end = dates[-1]
)
t1 <- sqldf("
SELECT d.start, d.end, d1.Y
FROM d LEFT JOIN d1
ON MAX(d.start, d1.start) < MIN(d.end, d1.end)
")
t2 <- sqldf("
SELECT d.start, d.end, d2.A, d2.B, d2.C
FROM d LEFT JOIN d2
ON MAX(d.start, d2.start) < MIN(d.end, d2.end)
")
sqldf( "SELECT * FROM t1 JOIN t2 USING (start, end)" )
使用的样本数据:
d1 <- data.frame(
start = c(96.98, 97.02, 97.06),
end = c(97.005, 97.05, 97.095),
Y = c(1.48, 0.42, 4.78)
)
d2 <- data.frame(
start = c(96.95, 97, 97.05),
end = c(97, 97.05, 97.1),
A = c(100,0,0),
B = c(0,0,0),
C = c(0,100,100)
)
https://stackoverflow.com/questions/57032115
复制相似问题