首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >多条件去重

多条件去重
EN

Stack Overflow用户
提问于 2017-11-13 23:22:18
回答 3查看 1.1K关注 0票数 0

我有一个数据,其中一个人(名字)在一个eggphase类别中出现了多次。我希望每个人只有一个样本,但我不想只保留R找到的第一个样本。我想保留该组在所有其他类别中出现最多的那个类别。希望我的例子能让你明白这一点。

代码语言:javascript
运行
复制
library(tidyverse)
myDF <- read.table(text="Tissue Food Eggphase Name Group
  wb fl after Kia a
  wb fl after Kia c
  wb wf before Kia b
  wb fl before Lucy c
  wb fl after Lucy b
  wb fl after Lucy c
  wb fl yolkdep Jess c
  wb fl yolkdep Betty a
  wb fl yolkdep Betty b", header = TRUE)

我只想保留名称按组织、食物和鸡蛋阶段分组后出现的行,但我想选择组出现在大多数不同鸡蛋阶段(具有相同的组织和食物组合)的行。

代码语言:javascript
运行
复制
   #results I want
  Tissue Food Eggphase  Name Group
1     wb   fl    after   Kia     c
2     wb   wf   before   Kia     b
3     wb   fl   before  Lucy     c
4     wb   fl    after  Lucy     c
5     wb   fl  yolkdep  Jess     c
6     wb   fl  yolkdep Betty     b

我试过了

代码语言:javascript
运行
复制
one_bird <- myDF %>% 
  distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)

但它只保留第一个条目

代码语言:javascript
运行
复制
  Tissue Food Eggphase  Name Group
1     wb   fl    after   Kia     a
2     wb   wf   before   Kia     b
3     wb   fl   before  Lucy     c
4     wb   fl    after  Lucy     b
5     wb   fl  yolkdep  Jess     c
6     wb   fl  yolkdep Betty     b

在如何告诉它选择Group出现在Tissue Food组合中大多数(如果不是全部)蛋相中的行上有什么想法吗?在我的示例中,在wbflTissueFood组合中出现最多的组是cb,但Kia不会出现在Group b中,因此c是更好的选择。例如,我的数据有重复项,这些重复项来自不是最常见的Group的组,我如何让它仅为该行选择下一个最常见的are?

我希望我已经讲得够有道理了。

EN

回答 3

Stack Overflow用户

发布于 2017-11-13 23:28:14

一种选择是创建一个按“组织”、“食物”、“组”分组的频率列,然后对“n”执行降序arrange,并使用distinct

代码语言:javascript
运行
复制
library(dplyr)
myDF %>%
     group_by(Tissue, Food, Group) %>%
     mutate(n = n()) %>% arrange(Tissue, Food, Eggphase, Name, desc(n)) %>% 
     ungroup %>%
     distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE) %>%
     select(-n)
票数 2
EN

Stack Overflow用户

发布于 2017-11-13 23:57:49

我想这篇文章和答案应该会给我学习dplyr和tidyverse的理由,但既然我已经努力给出了一个有效的答案,下面就是:

代码语言:javascript
运行
复制
myDF <- read.table(text="Tissue Food Eggphase Name Group
  wb fl after Kia a
  wb fl after Kia c
  wb wf before Kia b
  wb fl before Lucy c
  wb fl after Lucy b
  wb fl after Lucy c
  wb fl yolkdep Jess c
  wb fl yolkdep Betty a
  wb fl yolkdep Betty b", header = TRUE)

# I usually have the following setting active: options(stringsAsFactors=F)
# The following might error without such a setting

# Create a var that indicates a duplicate or a record with a duplicate
myDF$duplicate <- duplicated(myDF[,c('Name','Eggphase','Tissue','Food')])
myDF$duplicate <- ifelse(duplicated(myDF[,c('Name','Eggphase','Tissue','Food')],fromLast=T),yes=T, no=myDF$duplicate)

# Count eggphases by group 
eggphaseCount <- with(myDF,aggregate(x=list(Group_phaseCt=Eggphase),by=list(Group=Group),FUN=function(x) length(unique(x))))
# Merge to DF
myDF <- merge(myDF,eggphaseCount,by='Group',all=T)

# Get the max # of egphases by name
scale <- with(myDF,aggregate(x=list(PhaseMax=Group_phaseCt),by=list(Name=Name),FUN=max))
# Add to DF
myDF <- merge(myDF,scale,by='Name',all=T)

# Take the ratio
myDF$bestRatio <- with(myDF,Group_phaseCt/PhaseMax)
# Keep only those that aren't a duplicate, or are a duplicate and have the highest ratio
myDF2 <- myDF[with(myDF,which(duplicate==FALSE | (duplicate==TRUE & bestRatio==1))),]
票数 0
EN

Stack Overflow用户

发布于 2017-11-14 20:54:40

嘿,谢谢你们的帮助!你所建议的组合似乎起作用了:

代码语言:javascript
运行
复制
# Create a var that indicates a duplicate or a record with a duplicate
myDF$duplicate <- duplicated(myDF[,c('Name','Eggphase','Tissue','Food')])
#this won't tell you that the first entry og the combination is double
# so need to make them check against the previous row
myDF$duplicate <- ifelse(duplicated(myDF[,c('Name','Eggphase','Tissue','Food')],fromLast=T),yes=T, no=myDF$duplicate)

# Count eggphases by group 
eggphaseCount <- with(myDF,aggregate(x=list(Group_phaseCt=Eggphase),by=list(Group=Group),FUN=function(x) length(unique(x))))
# Merge to DF
myDF <- merge(myDF,eggphaseCount,by='Group',all=T)

# Get the max # of egphases by name
scale <- with(myDF,aggregate(x=list(PhaseMax=Group_phaseCt),by=list(Name=Name),FUN=max))
# Add to DF
myDF <- merge(myDF,scale,by='Name',all=T)

# Take the ratio
myDF$bestRatio <- with(myDF,Group_phaseCt/PhaseMax)

# make new df without duplicates
myDF2 <- myDF %>% 
#arrange in a way that the first duplicate is from the group with the most eggphases
#and the name appears in the most egg phases 
  arrange(Tissue, Food, Eggphase, Name, Group, desc(Group_phaseCt), desc(PhaseMax)) %>% 
#select only distinct rows according to specified var keep all others
  distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/47267725

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档