首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何使用R中的模糊匹配连接数据?

如何使用R中的模糊匹配连接数据?
EN

Stack Overflow用户
提问于 2019-07-05 19:19:52
回答 1查看 82关注 0票数 1

我有一些主题和许可数据,并希望创建一个列,该列标记给定所列主题的许可是否是合适的。另外的挑战是,一些教师教授多门学科,用分号分隔,每个许可证都有几个可以接受的科目。

我认为我需要合并一些类似grep的内容,但我不太确定如何在加入两个表的数据的同时添加这个函数。

示例代码

以下是我的数据摘要:

代码语言:javascript
运行
复制
df1 <- data.frame(Subject = c("Spanish Language Arts; I teach all subjects for my students", 
"Math; Science", "Mathematics; ELA", "ELA", "Science;Math;English Language Arts", 
"Spanish Language Arts; I teach all subjects for my students",
 "Math", "Science;Social Studies;Mathematics;English Language Arts", "ELA", 
"English Language Arts"), 
Licensure = c("Content Area - Early Childhood (preK-Grade 3)", 
"Core Subjects (Grades EC-6) 1770", "Mathematics (Grades 7-12) 1706", 
"English Language Arts and Reading (Grades 7-12) 1709", "Core Subjects (Grades EC-6) 1770", 
"English Language Arts and Reading (Grades 7-12) 1709", 
"English Language Arts and Reading (Grades 7-12) 1709", 
"Content Area - Elementary Education (Grades 1-6)", 
"Mathematics (Grades 7-12) 1706", "Content Area - Elementary Education (Grades 1-6)"))

下面是我创建的列表,其中包含了所有许可,每个许可下面都有可接受的程序:

代码语言:javascript
运行
复制
lic.subject_index <- list(
  "Content Area - Early Childhood (preK-Grade 3)" = c("I teach all subjects for my students", "Math", "Mathematics", "ELA", "English Language Arts", "Language Arts"),
  "Content Area - Elementary Education (Grades 1-6)" = c("I teach all subjects for my students", "Math", "Mathematics", "ELA", "English Language Arts", "Language Arts"),
  "Core Subjects (Grades EC-6) 1770" = c("I teach all subjects for my students", "Math", "Mathematics", "ELA", "English Language Arts", "Language Arts"),
  "English Language Arts and Reading (Grades 7-12) 1709" = c("ELA", "English Language Arts", "Language Arts"),
  "Mathematics (Grades 7-12) 1706" = c("Math", "Mathematics")
)

我想要做的是创建一个列,标记subject/license组合是否可以接受:

代码语言:javascript
运行
复制
ideal.df <- data.frame(Subject = c("Spanish Language Arts; I teach all subjects for my students", 
"Math; Science", "Mathematics; ELA", "ELA", "Science;Math;English Language Arts", 
"Spanish Language Arts; I teach all subjects for my students", "Math", 
"Science;Social Studies;Mathematics;English Language Arts", "ELA", "English Language Arts"), 
Licensure = c("Content Area - Early Childhood (preK-Grade 3)", "Core Subjects (Grades EC-6) 1770", 
"Mathematics (Grades 7-12) 1706", "English Language Arts and Reading (Grades 7-12) 1709", 
"Core Subjects (Grades EC-6) 1770", "English Language Arts and Reading (Grades 7-12) 1709", 
"English Language Arts and Reading (Grades 7-12) 1709", "Content Area - Elementary Education (Grades 1-6)", 
"Mathematics (Grades 7-12) 1706", "Content Area - Elementary Education (Grades 1-6)"), 
flag = c("True", "True", "True", "True", "True", "False", "False", "True", "False", "True"))

谢谢您能提供的任何帮助!

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-07-05 19:46:31

下面是tidyversefuzzyjoin的一个选项

代码语言:javascript
运行
复制
library(fuzzyjoin)
library(tidyverse)
out <- df1 %>%
       rownames_to_column('rn') %>% 
       separate_rows(Subject, sep = ';') %>% 
       stringdist_left_join(
         enframe(lic.subject_index, name = 'Licensure', value = 'Subject') %>% 
              unnest) %>% 
       group_by(rn = as.integer(rn)) %>%
       summarise(ind = any(!is.na(Licensure.y))) %>%
       ungroup %>% 
       pull(ind) %>% 
       mutate(df1, flag = .)
out$flag
#[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE

-checking OP的理想输出

代码语言:javascript
运行
复制
as.logical(ideal.df$flag)
#[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE
票数 4
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/56908276

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档