我和我的团队正在处理数千个具有相似段的URL。一些URL在我们感兴趣的位置有一个片段("seg",复数,"segs")。其他类似的URL在我们感兴趣的位置有不同的seg。我们需要在感兴趣的位置对由URL和关联的唯一seg组成的数据帧进行排序,以显示这些唯一seg的频率。
下面是一个简化的示例:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
我们正在寻找以下内容:
url freq seg
1 3 a in other words, url #1 appears three times each with a seg = "a",
2 2 b in other words: url #2 appears twice each with a seg = "b",
3 3 c in other words: url #3 appears three times with a seg = "c",
3 2 x two times with a seg = "x", and,
3 1 y once with a seg = "y"
4 1 d etc.
我可以使用一个循环和几个小步骤到达那里,但我相信有一种更优雅的方法可以做到这一点。这是我不优雅的方法:
创建包含num.unique行和三列(url,freq,seg)的空数据帧
result <- data.frame(url=0, Freq=0, seg=0)
确定唯一的URL
unique.df.url <- unique(df$url)
循环遍历数据帧
for (xx in unique.df.url) {
url.seg <- df[which(df$url == unique.df.url[xx]), ] # create a dataframe for each of the unique urls and associated segs
freq.df.url <- data.frame(table(url.seg)) # summarize the frequency distribution of the segs by url
result <- rbind(result,freq.df.url) # append a new data.frame onto the last one
}
消除数据帧中Frequency =0的行
result.freq <- result[which(result$Freq |0), ]
按URL对数据帧进行排序
result.order <- result.freq[order(result.freq$url), ]
这会产生预期的结果,但由于它是如此不雅,我担心一旦我们移动到规模,所需的时间将是令人望而却步的,或者至少是一个令人担忧的问题。有什么建议吗?
发布于 2018-05-24 19:06:38
在base R中,您可以这样做:
aggregate(freq~seg+url,`$<-`(df,freq,1),sum)
# or aggregate(freq~seg+url, data.frame(df,freq=1),sum)
# seg url freq
# 1 a 1 3
# 2 b 2 2
# 3 c 3 3
# 4 x 3 2
# 5 y 3 1
# 6 d 4 1
$<-
的诀窍就是在所有地方添加一个值为1的列freq
,而不更改源表。
另一种可能性是:
subset(as.data.frame(table(df[2:1])),Freq!=0)
# seg url Freq
# 1 a 1 3
# 8 b 2 2
# 15 c 3 3
# 17 x 3 2
# 18 y 3 1
# 22 d 4 1
在这里,我使用[2:1]
来切换列的顺序,以便table
以所需的方式对结果进行排序。
发布于 2018-05-24 00:17:29
下面的代码对你来说会更好吗?
library(dplyr)
df %>% group_by(url, seg) %>% summarise(n())
发布于 2018-05-24 00:22:19
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
library(dplyr)
df %>% count(url, seg) %>% arrange(url, desc(n))
# # A tibble: 6 x 3
# url seg n
# <dbl> <fct> <int>
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1
https://stackoverflow.com/questions/50492862
复制相似问题