首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >在两个数据集之间匹配列表中的值

在两个数据集之间匹配列表中的值
EN

Stack Overflow用户
提问于 2019-10-30 17:01:07
回答 2查看 45关注 0票数 0

我有两个我正在处理的数据集。第一个是:

代码语言:javascript
运行
复制
data_1 <- tribble(
  ~shop_name, ~sub_category,
  "A",        "Blu-ray, DVDs, CD",
  "B",        "Sneakers, Make-up, Blu-ray",         
  "C",        "Camera, Optic, DVDs",
  "D",        "Flower, Notebooks, Make-up", 
)

第二个是:

代码语言:javascript
运行
复制
data_2 <- tribble(
  ~sub_category, ~main_category,
  "Blu-ray",      "Electronic",
  "DVDs",         "Electronic",        
  "CD",           "Electronic",
  "Sneakers",     "Fashion",
  "Make-up",      "Fashion", 
  "Camera",       "Electronic",
  "Optic",        "Health", 
  "Flower",       "Home",
)

现在,我想执行左连接来添加data_1中的主类别。最终的数据应该如下所示:

代码语言:javascript
运行
复制
merged_data <- tribble(
  ~shop_name, ~sub_category,                 ~main_category,
  "A",        "Blu-ray, DVDs, CD",            "Electronic,  Electronic,  Electronic",
  "B",        "Sneakers, Make-up, Blu-ray",   "Fashion,  Fashion,  Electronic",      
  "C",        "Camera, Optic",                "Electronic, Health",
  "D",        "Flower",                       "Home"
)

我的代码如下:

代码语言:javascript
运行
复制
data3 <- left_join(data_1, data_2, by = "sub_category")

但不知何故,main_category返回了NA。有人能帮帮我吗?提前谢谢。

EN

回答 2

Stack Overflow用户

发布于 2019-10-30 17:32:11

以下是两个记录在案的data.table解决方案:

代码

您可以直接将data_1subcategory中的每个字符串与data_2中对应的main_category进行匹配

代码语言:javascript
运行
复制
require(data.table); setDT(data_1); setDT(data_2)

data_1[, main_category := sapply(sub_category, function(x){

  str = unlist(strsplit(x, ', '))
  match = as.numeric(sapply(str, function(x) data_2[, which(x == sub_category)]))
  data_2[match, paste(main_category, collapse = ', ')]

})]

或者,将data_1转换为长格式并在sub_category上与data_2连接

代码语言:javascript
运行
复制
data_1 = data_1[, .(sub_category = unlist(strsplit(sub_category, ', '))), keyby = shop_name] # data_1 to long format
dt_final = merge(data_1, data_2, by = 'sub_category', all = T) # Join data_1 and data_2 on sub_category
dt_final = dt_final[, lapply(.SD, function(x) paste(x, collapse = ', ')), keyby = shop_name]

结果

代码语言:javascript
运行
复制
> data_1
   shop_name               sub_category                      main_category
1:         A          Blu-ray, DVDs, CD Electronic, Electronic, Electronic
2:         B Sneakers, Make-up, Blu-ray       Fashion, Fashion, Electronic
3:         C        Camera, Optic, DVDs     Electronic, Health, Electronic
4:         D Flower, Notebooks, Make-up                  Home, NA, Fashion

> dt_final
   shop_name               sub_category                      main_category
1:         A          Blu-ray, CD, DVDs Electronic, Electronic, Electronic
2:         B Blu-ray, Make-up, Sneakers       Electronic, Fashion, Fashion
3:         C        Camera, DVDs, Optic     Electronic, Electronic, Health
4:         D Flower, Make-up, Notebooks                  Home, Fashion, NA
票数 1
EN

Stack Overflow用户

发布于 2019-10-30 17:17:14

基本上,您需要从data_1中拆分子类别字符串,然后加入,即

代码语言:javascript
运行
复制
data_1 %>% 
 separate_rows(sub_category, sep = ', ') %>% 
 left_join(data_2, by = 'sub_category') %>% 
 group_by(shop_name) %>% 
 summarise_all(funs(toString))

这给了我们

A tibble: 4 x 3 shop_name sub_category main_category 1 A蓝光,DVD,CD电子,电子,电子2B运动鞋,化妆,蓝光时尚,时尚,电子3C相机,光学,DVD电子,健康电子4 D花,笔记本电脑,化妆家居,北美,时尚

如果您有更多的列,则需要用summarise_at(vars(contains('category')), funs(toString))替换summarise_all

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/58621721

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档