首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >将NA转换为另一列中最常见的值

将NA转换为另一列中最常见的值
EN

Stack Overflow用户
提问于 2018-06-03 00:46:24
回答 1查看 61关注 0票数 1

我有一个名为df的数据帧,如下所示:

代码语言:javascript
复制
  Author_ID Country Cited Name  Title
1: 1        Spain   10    Alex  Whatever
2: 1        France  15    Ale   Whatever2
3: 1        NA      10    Alex  Whatever3
4: 1        Spain   10    Alex  Whatever4
5: 2        Italy   10    Alice Whatever5
6: 2        Greece  10    Alice Whatever6
7: 2        Greece  10    Alice Whatever7
8: 2        NA      10    Alce  Whatever8
8: 2        NA      10    Alce  Whatever8

我想得到这样的结果,其中NA被替换为该Author_ID出现次数最多的国家(如果有两个国家出现相同的次数,这两个国家之间的随机将是好的):

代码语言:javascript
复制
Author_ID Country Cited Name  Title
    1: 1        Spain   10    Alex  Whatever
    2: 1        France  15    Ale   Whatever2
    3: 1        Spain   10    Alex  Whatever3
    4: 1        Spain   10    Alex  Whatever4
    5: 2        Italy   10    Alice Whatever5
    6: 2        Greece  10    Alice Whatever6
    7: 2        Greece  10    Alice Whatever7
    8: 2        Greece  10    Alce  Whatever8
    8: 2        Greece  10    Alce  Whatever8

提前谢谢。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-06-03 01:00:53

使用data.table

代码语言:javascript
复制
library(data.table)
# setDT(df)
df[,Country := replace(Country,is.na(Country),names(which.max(table(Country)))),by=Author_ID]

#    Author_ID Country Cited  Name     Title
# 1:         1   Spain    10  Alex  Whatever
# 2:         1  France    15   Ale Whatever2
# 3:         1   Spain    10  Alex Whatever3
# 4:         1   Spain    10  Alex Whatever4
# 5:         2   Italy    10 Alice Whatever5
# 6:         2  Greece    10 Alice Whatever6
# 7:         2  Greece    10 Alice Whatever7
# 8:         2  Greece    10  Alce Whatever8
# 9:         2  Greece    10  Alce Whatever8

在基本R

代码语言:javascript
复制
df$Country <- unlist(tapply(df$Country,df$Author_ID,function(x)
  replace(x,is.na(x),names(which.max(table(x))))))
# Author_ID Country Cited  Name     Title
# 1         1   Spain    10  Alex  Whatever
# 2         1  France    15   Ale Whatever2
# 3         1   Spain    10  Alex Whatever3
# 4         1   Spain    10  Alex Whatever4
# 5         2   Italy    10 Alice Whatever5
# 6         2  Greece    10 Alice Whatever6
# 7         2  Greece    10 Alice Whatever7
# 8         2  Greece    10  Alce Whatever8
# 9         2  Greece    10  Alce Whatever8

使用dplyr

代码语言:javascript
复制
library(dplyr)
df %>% group_by(Author_ID) %>%
  mutate(Country = replace(
    Country,
    is.na(Country),
    names(which.max(table(Country)))))

# # A tibble: 9 x 5
# # Groups:   Author_ID [2]
# Author_ID Country Cited  Name     Title
# <int>   <chr> <int> <chr>     <chr>
# 1         1   Spain    10  Alex  Whatever
# 2         1  France    15   Ale Whatever2
# 3         1   Spain    10  Alex Whatever3
# 4         1   Spain    10  Alex Whatever4
# 5         2   Italy    10 Alice Whatever5
# 6         2  Greece    10 Alice Whatever6
# 7         2  Greece    10 Alice Whatever7
# 8         2  Greece    10  Alce Whatever8
# 9         2  Greece    10  Alce Whatever8

如果几个国家出现的时间最长,它将占用第一个国家,而不是随机的。

如果国家/地区对于某些作者来说仅为NA,则为

首先调用以下代码来修改示例数据:

代码语言:javascript
复制
df$Country[df$Author_ID ==2] <- NA

然后是3个改编的解决方案,虽然不是很优雅,但它是有效的。我怀疑可能有一个base/dplyr/data.table函数可以更顺利地将零长度元素更改为NA

代码语言:javascript
复制
setDT(df)
df[,Country := replace(Country,is.na(Country),{
  nm <- names(which.max(table(x)))
  if(length(nm)==0) NA else nm}),
  by=Author_ID]
df <- df[!is.na(df$Country),]

#    Author_ID Country Cited Name     Title
# 1:         1   Spain    10 Alex  Whatever
# 2:         1  France    15  Ale Whatever2
# 3:         1   Spain    10 Alex Whatever4

df$Country <- unlist(tapply(df$Country,df$Author_ID,function(x)
  replace(x,is.na(x),{
    nm <- names(which.max(table(x)))
    if(length(nm)==0) NA else nm
    })))
df <- df[!is.na(df$Country),]

# Author_ID Country Cited Name     Title
# 1         1   Spain    10 Alex  Whatever
# 2         1  France    15  Ale Whatever2
# 3         1   Spain    10 Alex Whatever3
# 4         1   Spain    10 Alex Whatever4

df %>% group_by(Author_ID) %>%
  mutate(Country = replace(
    Country,
    is.na(Country),
    names(which.max(table(Country))) %>%
      {if(length(.)==0) NA else .})) %>%
  filter(!is.na(Country))

# # A tibble: 4 x 5
# # Groups:   Author_ID [1]
# Author_ID Country Cited Name  Title    
# <int> <chr>   <int> <chr> <chr>    
# 1         1 Spain      10 Alex  Whatever 
# 2         1 France     15 Ale   Whatever2
# 3         1 Spain      10 Alex  Whatever3
# 4         1 Spain      10 Alex  Whatever4

data

代码语言:javascript
复制
df <- read.table(text="Author_ID Country Cited Name  Title
1        Spain   10    Alex  Whatever
1        France  15    Ale   Whatever2
1        NA      10    Alex  Whatever3
1        Spain   10    Alex  Whatever4
2        Italy   10    Alice Whatever5
2        Greece  10    Alice Whatever6
2        Greece  10    Alice Whatever7
2        NA      10    Alce  Whatever8
2        NA      10    Alce  Whatever8",h=T,strin=F)
票数 4
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50659096

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档