我有两个我正在处理的数据集。第一个是:
data_1 <- tribble(
~shop_name, ~sub_category,
"A", "Blu-ray, DVDs, CD",
"B", "Sneakers, Make-up, Blu-ray",
"C", "Camera, Optic, DVDs",
"D", "Flower, Notebooks, Make-up",
)第二个是:
data_2 <- tribble(
~sub_category, ~main_category,
"Blu-ray", "Electronic",
"DVDs", "Electronic",
"CD", "Electronic",
"Sneakers", "Fashion",
"Make-up", "Fashion",
"Camera", "Electronic",
"Optic", "Health",
"Flower", "Home",
)现在,我想执行左连接来添加data_1中的主类别。最终的数据应该如下所示:
merged_data <- tribble(
~shop_name, ~sub_category, ~main_category,
"A", "Blu-ray, DVDs, CD", "Electronic, Electronic, Electronic",
"B", "Sneakers, Make-up, Blu-ray", "Fashion, Fashion, Electronic",
"C", "Camera, Optic", "Electronic, Health",
"D", "Flower", "Home"
)我的代码如下:
data3 <- left_join(data_1, data_2, by = "sub_category")但不知何故,main_category返回了NA。有人能帮帮我吗?提前谢谢。
发布于 2019-10-30 17:32:11
以下是两个记录在案的data.table解决方案:
代码
您可以直接将data_1的subcategory中的每个字符串与data_2中对应的main_category进行匹配
require(data.table); setDT(data_1); setDT(data_2)
data_1[, main_category := sapply(sub_category, function(x){
str = unlist(strsplit(x, ', '))
match = as.numeric(sapply(str, function(x) data_2[, which(x == sub_category)]))
data_2[match, paste(main_category, collapse = ', ')]
})]或者,将data_1转换为长格式并在sub_category上与data_2连接
data_1 = data_1[, .(sub_category = unlist(strsplit(sub_category, ', '))), keyby = shop_name] # data_1 to long format
dt_final = merge(data_1, data_2, by = 'sub_category', all = T) # Join data_1 and data_2 on sub_category
dt_final = dt_final[, lapply(.SD, function(x) paste(x, collapse = ', ')), keyby = shop_name]结果
> data_1
shop_name sub_category main_category
1: A Blu-ray, DVDs, CD Electronic, Electronic, Electronic
2: B Sneakers, Make-up, Blu-ray Fashion, Fashion, Electronic
3: C Camera, Optic, DVDs Electronic, Health, Electronic
4: D Flower, Notebooks, Make-up Home, NA, Fashion
> dt_final
shop_name sub_category main_category
1: A Blu-ray, CD, DVDs Electronic, Electronic, Electronic
2: B Blu-ray, Make-up, Sneakers Electronic, Fashion, Fashion
3: C Camera, DVDs, Optic Electronic, Electronic, Health
4: D Flower, Make-up, Notebooks Home, Fashion, NA发布于 2019-10-30 17:17:14
基本上,您需要从data_1中拆分子类别字符串,然后加入,即
data_1 %>%
separate_rows(sub_category, sep = ', ') %>%
left_join(data_2, by = 'sub_category') %>%
group_by(shop_name) %>%
summarise_all(funs(toString))这给了我们
A tibble: 4 x 3 shop_name sub_category main_category 1 A蓝光,DVD,CD电子,电子,电子2B运动鞋,化妆,蓝光时尚,时尚,电子3C相机,光学,DVD电子,健康电子4 D花,笔记本电脑,化妆家居,北美,时尚
如果您有更多的列,则需要用summarise_at(vars(contains('category')), funs(toString))替换summarise_all
https://stackoverflow.com/questions/58621721
复制相似问题