这似乎应该是一个相当简单的问题,但我似乎找不到一个简单的解决方案。
我有一个字符列表,如下所示:
my_info <- c("Fruits",
"North America",
"Apples",
"Michigan",
"Europe",
"Pomegranates",
"Greece",
"Oranges",
"Italy",
"Vegetables",
"North America",
"Potatoes",
"Idaho",
"Avocados",
"California",
"Europe",
"Artichokes",
"Italy",
"Meats",
"North America",
"Beef",
"Illinois")
我想把这个字符向量解析成一个数据帧,看起来像这样:
食物类型和区域列表将始终保持不变,但食物及其位置可能会发生变化。
food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")
我在想我需要使用像str_split这样的东西,但是使用food_types和regions作为某种分隔符?但我不确定该如何继续。字符向量确实有一个顺序。
谢谢。
发布于 2018-06-06 02:07:50
一种解决方案是首先使用ncol = 4
将my_info
向量转换为矩阵。这将在矩阵/数据帧中拆分您的向量。
现在,您可以应用for food_type
和region
的规则,并交换存在于其他列中的任何food_type
或region
。
注意:我请求OP检查数据一次,似乎每4个元素都不能与OP提供的描述组成一个完整的行。
df <- as.data.frame(matrix(my_info, ncol = 4, byrow = TRUE))
names(df) <- c("Foodtype", "Region", "Food", "Location")
food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")
t(apply(df,1,function(x){
for(i in seq_along(x)){
#One can think of writing a swap function here.
if(x[i] %in% region ){
temp = x[i]
x[i] = x[2]
x[2] = temp
}
#Swap any food_type wrongly placed in other column
if(x[i] %in% food_type ){
temp = x[i]
x[i] = x[1]
x[1] = temp
}
}
x
}))
# Foodtype Region Food Location
# [1,] "Fruits" "North America" "Apples" "Michigan"
# [2,] "Pomegranates" "Europe" "Greece" "Oranges"
# [3,] "Vegetables" "North America" "Italy" "Potatoes"
# [4,] "Idaho" "Europe" "California" "Avocados"
# [5,] "Meats" "North America" "Artichokes" "Italy"
# [6,] "Fruits" "North America" "Beef" "Illinois"
#
发布于 2018-06-06 03:51:55
我有一个很长的解决方案,但只要食物和位置始终保持相同的顺序,我就应该工作。
首先用dplyr
创建一些data.frames。
library(dplyr)
info <- data_frame(my_info = my_info)
region <- data_frame(region_id = region, region = region)
food_type <- data_frame(food_type_id = food_type, food_type)
接下来,创建一个将所有这些连接在一起的data.frame,并使用tidyr
填充缺少的值,并删除我们不需要的行。然后,最重要的技巧是最后一个技巧,它基于顺序始终相同的假设创建一个cols列!
library(tidyr)
df <- info %>%
left_join(food_type, by = c("my_info" = "food_type_id")) %>%
left_join(region, by = c("my_info" = "region_id")) %>%
fill(food_type) %>%
group_by(food_type) %>%
fill(region) %>%
filter(!is.na(region) & !(my_info == region)) %>%
ungroup %>%
mutate(cols = rep(c("food", "location"), group_size(.)/2 ))
这将返回:
# A tibble: 14 x 4
my_info food_type region cols
<chr> <chr> <chr> <chr>
1 Apples Fruits North America food
2 Michigan Fruits North America location
3 Pomegranates Fruits Europe food
4 Greece Fruits Europe location
5 Oranges Fruits Europe food
6 Italy Fruits Europe location
7 Beef Meats North America food
8 Illinois Meats North America location
9 Potatoes Vegetables North America food
10 Idaho Vegetables North America location
11 Avocados Vegetables North America food
12 California Vegetables North America location
13 Artichokes Vegetables Europe food
14 Italy Vegetables Europe location
接下来,使用tidyr
将cols展开到食物和位置列中。
df <- df %>%
group_by(food_type, region, cols) %>%
mutate(ind = row_number()) %>%
spread(cols, my_info) %>%
select(-ind)
# A tibble: 7 x 4
# Groups: food_type, region [5]
food_type region food location
<chr> <chr> <chr> <chr>
1 Fruits Europe Pomegranates Greece
2 Fruits Europe Oranges Italy
3 Fruits North America Apples Michigan
4 Meats North America Beef Illinois
5 Vegetables Europe Artichokes Italy
6 Vegetables North America Potatoes Idaho
7 Vegetables North America Avocados California
这一切都可以一气呵成,只需删除创建data.frame的中间步骤。
发布于 2018-06-06 04:54:45
这里有三个替代方案。它们都使用来自zoo的na.locf0
和仅在第一个中显示的cn
载体。
1)假设cn
是一个长度与my_info
相同的向量,它标识my_info
的元素属于输出的哪个列号。假设cdef
是1:4的输出列定义向量,输出列名作为其名称。然后,为每个输出列创建一个长度与my_info
相同的向量,其行对应于该列,并为其他元素创建NAs。然后使用na.locf0
填充NA值并获取与第4列对应的元素。
library(zoo)
cn <- (my_info %in% food_type) + 2 * (my_info %in% region)
cn[cn == 0] <- 3:4
cdef <- c(food_type = 1, region = 2, food = 3, location = 4)
m <- sapply(cdef, function(i) na.locf0(ifelse(cn == i, my_info, NA))[cn == 4])
给予:
> m
food_type region food location
[1,] "Fruits" "North America" "Apples" "Michigan"
[2,] "Fruits" "Europe" "Pomegranates" "Greece"
[3,] "Fruits" "Europe" "Oranges" "Italy"
[4,] "Vegetables" "North America" "Potatoes" "Idaho"
[5,] "Vegetables" "North America" "Avocados" "California"
[6,] "Vegetables" "Europe" "Artichokes" "Italy"
[7,] "Meats" "North America" "Beef" "Illinois"
我们创建了字符矩阵输出,因为输出完全是字符,但如果您无论如何都想要一个数据帧,那么使用:
as.data.frame(mm, stringsAsFactors = FALSE)
2)可替换地,我们可以通过将m
放入NAs的n×4矩阵mm的位置(i,cni)中,使用na.locf
来填充NAs并取对应于列4的那些行,来从cn
创建cn
。
n <- length(my_info)
m2 <- na.locf(replace(matrix(NA, n, 4), cbind(1:n, cn), my_info))[cn == 4, ]
colnames(m2) <- c("food_type", "region", "food", "location")
identical(m2, m) # test
## [1] TRUE
3)从cn
创建m
的第三种选择是逐列构造矩阵,如下所示:
m3 <- cbind( food_type = na.locf0(ifelse(cn == 1, my_info, NA))[cn == 3],
region = na.locf0(ifelse(cn == 2, my_info, NA))[cn == 3],
food = my_info[cn == 3],
location = my_info[cn == 4])
identical(m, m3) # test
## [1] TRUE
https://stackoverflow.com/questions/50706041
复制相似问题