文章/答案/技术大牛

发布

社区首页 >问答首页 >在R中从DataFrame中删除重复项

问在R中从DataFrame中删除重复项
EN

Stack Overflow用户

提问于 2017-10-26 16:40:26

回答 3查看 511关注 0票数 0

我有这些数据

UserID   Quiz_answers            Quiz_Date       
  1     `a1,a2,a3`Positive       26-01-2017        
  1     `a1,a4,a3`Positive       26-01-2017        
  1     `a1,a2,a4`Negative       28-02-2017        
  1     `a1,a2,a3`Neutral        30-10-2017        
  1     `a1,a2,a4`Positive       30-11-2017        
  1     `a1,a2,a4`Negative       28-02-2018    

  2     `a1,a2,a3`Negative       27-01-2017            
  2     `a1,a7,a3`Neutral        28-08-2017        
  2     `a1,a2,a5`Negative       28-01-2017

我想删除重复的行：

重复行的规则如下：

Quiz_answers列中反勾(`)后出现的单词相同
对于这类行，如果userID和Quiz_Date列值也是相同的，则该行是重复的UserID<-c(1,1,1,1,1,1,2,2,2) Quiz\_answers<-c("a1、a2、a3Positive","a1、a4、a3Positive","a1、a2、a4Negative","a1,a2,a3中性值、“a1,a2,a4正”、“a1,a2,a4负值”、“a1,a2,a3负值”、“Negative","a1,a2,a3中性”、“负数”) Quiz_Date<-as.Date(c("26-01-2017“、"26-01-2017”、"28-02-2017“、"30-10-2017”)，"30-11-2017“、"28-02-2018”、"27-01-2017“、"28-08-2017”、"28-01-2017")、'%d-%m-%Y')数据<-data.framework(UserID、Quiz_answers、Quiz_Date)

-I编写了以下代码

   data.removeDuplicates<- function(frames)
    {   
         apply(frames[ ,c(grep("UserID", colnames(data)),grep("Quiz_answers", colnames(data)),grep("Quiz_Date", colnames(data)))],1,function(slice){     
             Outcome<-paste0("`",tail(strsplit(slice[2],split="`")[[1]],1))      
             cat("\n\n Searching for records: ",slice[1],Outcome,slice[3])
            data<<-data[!( data$UserID == slice[1] &  paste0("`",sapply(strsplit(as.character(data[,2]),'`'), tail, 1)) == c(Outcome) & data[,3]==c(slice[3])), ]   
        })      
        print(frames)
    }
    data.removeDuplicates(data)
    print(data)
    [1] UserID       Quiz_answers Quiz_Date   
    <0 rows> (or 0-length row.names)

我在期待产量

UserID   Quiz_answers            Quiz_Date       
  1     `a1,a2,a3`Positive       26-01-2017        
  1     `a1,a2,a4`Negative       28-02-2017        
  1     `a1,a2,a3`Neutral        30-10-2017        
  1     `a1,a2,a4`Positive       30-11-2017        
  1     `a1,a2,a4`Negative       28-02-2018    

  2     `a1,a2,a3`Negative       27-01-2017            
  2     `a1,a7,a3`Neutral        28-08-2017        
  2     `a1,a2,a5`Negative       28-01-2017

根据规则，只有第二行应该从DataFrame中删除，这是满足重复条件的唯一行。我做错什么了？

dataframe

duplicates

apply

回答 3

Stack Overflow用户

回答已采纳

发布于 2017-10-26 16:50:41

试试看

你的数据

df <- read.table(text="UserID   Quiz_answers            Quiz_Date       
1     `a1,a2,a3`Positive       26-01-2017        
1     `a1,a4,a3`Positive       26-01-2017        
1     `a1,a2,a4`Negative       28-02-2017        
1     `a1,a2,a3`Neutral        30-10-2017        
1     `a1,a2,a4`Positive       30-11-2017        
1     `a1,a2,a4`Negative       28-02-2018    
2     `a1,a2,a3`Negative       27-01-2017            
2     `a1,a7,a3`Neutral        28-08-2017        
2     `a1,a2,a5`Negative       28-01-2017", header = TRUE, stringsAsFactors=FALSE)

解决与输出

library(dplyr)
ans <- df %>%
        mutate(grp = sub(".*`(\\D+)$", "\\1", Quiz_answers)) %>%
        group_by(grp, UserID, Quiz_Date) %>%
        slice(1) %>%
        ungroup() %>%
        select(-grp) %>%
        arrange(UserID, Quiz_Date)

# A tibble: 8 x 3
  # UserID       Quiz_answers  Quiz_Date
   # <int>              <chr>      <chr>
# 1      1 `a1,a2,a3`Positive 26-01-2017
# 2      1 `a1,a2,a4`Negative 28-02-2017
# 3      1 `a1,a2,a4`Negative 28-02-2018
# 4      1  `a1,a2,a3`Neutral 30-10-2017
# 5      1 `a1,a2,a4`Positive 30-11-2017
# 6      2 `a1,a2,a3`Negative 27-01-2017
# 7      2 `a1,a2,a5`Negative 28-01-2017
# 8      2  `a1,a7,a3`Neutral 28-08-2017

票数 1

Stack Overflow用户

发布于 2017-10-26 17:11:16

您可以使用sqldf包，如下所示。首先，查找Positive、Negative和Neutral组。然后，使用group by过滤副本

require("sqldf")
result <- sqldf("SELECT * FROM df WHERE Quiz_answers LIKE '%`Positive' GROUP BY UserID, Quiz_Date 
       UNION 
       SELECT * FROM df WHERE Quiz_answers LIKE '%`Negative' GROUP BY UserID, Quiz_Date 
       UNION 
       SELECT * FROM df WHERE Quiz_answers LIKE '%`Neutral' GROUP BY UserID, Quiz_Date")

运行后的result是：

  UserID       Quiz_answers  Quiz_Date
1      1  `a1,a2,a3`Neutral 30-10-2017
2      1 `a1,a2,a4`Negative 28-02-2017
3      1 `a1,a2,a4`Negative 28-02-2018
4      1 `a1,a2,a4`Positive 30-11-2017
5      1 `a1,a4,a3`Positive 26-01-2017
6      2 `a1,a2,a3`Negative 27-01-2017
7      2 `a1,a2,a5`Negative 28-01-2017
8      2  `a1,a7,a3`Neutral 28-08-2017

票数 0

Stack Overflow用户

发布于 2017-10-26 17:39:03

这里有一个两行解决方案，只使用基本R：

data[,"group"] <- with(data, sub(".*`", "", Quiz_answers))

data <- data[as.integer(rownames(unique(data[, !(names(data) %in% "Quiz_answers")   ]))), !(names(data) %in% "group")]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/46959590

复制

相似问题

问在R中从DataFrame中删除重复项
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在R中从DataFrame中删除重复项EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在R中从DataFrame中删除重复项
EN