文章/答案/技术大牛

发布

社区首页 >问答首页 >将多个不均匀嵌套列表转换为R中的DataFrame

问将多个不均匀嵌套列表转换为R中的DataFrame
EN

Stack Overflow用户

提问于 2015-08-29 10:44:04

回答 1查看 2K关注 0票数 1

我正试图抓住R，作为一个实验，我想我会尝试玩一些板球数据。它的最小格式是一个yaml文件，我使用yaml R包将其转换为一个R对象。

但是，我现在有一些嵌套列表，它们的长度不均匀，我想尝试将其转化为R中的数据框架。我尝试了几种方法，比如编写一些循环来解析数据和tidyr包中的一些函数。然而，我似乎不能让它很好地工作。

我想知道人们是否知道解决这一问题的最佳方法？在这里复制数据结构将很困难，因为复杂性来自于多个嵌套列表及其长度的不均匀性(这将导致非常长的代码块。但是，您可以在这里找到原始的yaml数据：http://cricsheet.org/downloads/ (我使用的是ODI )。

提前感谢！

更新我已经尝试过:1)使用tidyr分离

d <- unnest(balls)
Name <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal","WicketFielder","WicketKind","PlayerOut")
a <- separate(d, x, Name, sep = ",",extra = "drop")

它基本上使用了tidyr包，它返回一个列dataframe，然后我尝试将其分离。然而，这里的问题是，在中间，有时会出现一些行中出现的额外变量，而不是其他行，从而抛出分离。

2)向量的生成

ballsVector <- unlist(balls[[2]],use.names = FALSE)
names_vector <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal")
names(ballsVector) <- c(names_vector)
ballsMatrix <- matrix(ballsVector, nrow = 1, byrow = TRUE)
colnames(ballsMatrix) <- names_vector

这里的问题是，得到的向量长度不均匀，因此不能组合成一个数据帧。它还会遇到数据集中存在零星变量的问题(如上文所述)。

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-08-29 14:49:13

警告:没有完整的答案；尝试整理局的数据。

plyr::rbind.fill可以为不同列数的行提供一种解决方案。

我不使用tidyr，但下面是一些粗糙的代码，可以将赌博数据输入到data.frame中。然后，您可以遍历目录中的所有yaml文件。

# Download and unzip data
download.file("http://cricsheet.org/downloads/odis.zip", temp<- tempfile())
tmp <- unzip(temp)

# Create lists - use first game
library(yaml)
raw_dat <- yaml.load_file(tmp[[2]])
#names(raw_dat)

# Function to process list into dataframe
p_fun <- function(X) {
          team = X[[1]][["team"]]

          # function to process each list subelement that represents each throw
          fn <- function(...) {
                    tmp = unlist(...)
                    tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
                    colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
                    tmp
                    }
           # loop over all throws
           lst = lapply(X[[1]][["deliveries"]], fn )

           cbind(team, plyr:::rbind.fill(lst))
          }

# Loop over each innings
dat <- plyr::rbind.fill(lapply(raw_dat$innings, p_fun))

关于的一些解释

列表结构并对其进行设置。若要了解列表的结构，请使用

str(raw_dat) # but this gives a really long list of data

您可以截断它，使它更有用一些。

str(raw_dat, 3)
length(raw_dat)

因此，有三个主要的列表元素- meta、info和innings。您也可以看到这与

names(raw_dat)

要访问元数据，可以使用

raw_dat$meta
#or using `[[1]]` to access the first element of the list (see ?'[[')
raw_dat[[1]]
#and get sub-elements by either
raw_dat$meta$data_version
raw_dat[[1]][[1]] # you can also use the names of the list elements eg [[`data_version`]]

主要数据在innings元素中。

str(raw_dat$innings, 3)

查看list元素中的名称

lapply(raw_dat$innings, names)
lapply(raw_dat$innings[[1]], names)

有两个list元素，每个元素都有子元素。您可以访问这些

raw_dat$innings[[1]][[1]][["team"]] # raw_dat$innings[[1]][["1st innings"]][["team"]]
raw_dat$innings[[2]][[1]][["team"]] # raw_dat$innings[[2]][["2nd innings"]][["team"]]

上面的函数解析了raw_dat$innings中的交付数据。要想了解它的作用，就得从内部着手。

使用一条记录来查看它是如何工作的(注意lapply，使用p_fun，循环在raw_dat$innings[[1]]和raw_dat$innings[[2]]上；因此，这是外部循环，而lapply，有fn，在一局内循环通过交付；内环)

X <- raw_dat$innings[[1]] 
tmp <- X[[1]][["deliveries"]][[1]]
tmp

#create a named vector
tmp <- unlist(tmp)
tmp
#      0.1.batsman       0.1.bowler  0.1.non_striker 0.1.runs.batsman  0.1.runs.extras   0.1.runs.total 
#        "IR Bell"       "DW Steyn"       "MJ Prior"              "0"              "0"              "0"

要使用rbind.fill，要绑定在一起的元素必须是data.frames。我们还希望从名称中删除前面的数字/传递，否则我们将有很多唯一的名称列。

# this regex removes all non-numeric characters from the string
# you could then split this number into over and delivery
gsub("[^0-9]", "", names(tmp)) 

# this regex removes all numeric characters from the string -
# allowing consistent names across all the balls / deliveries
# (if i was better at regex I would have also removed the leading dots)
gsub("[0-9]", "", names(tmp))

所以，对于第一局的第一次投球，我们有

tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
#   ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1   01    IR Bell  DW Steyn       MJ Prior               0              0             0

要查看lapply如何工作，请使用前三次传递(您需要在工作区中运行函数fn )

lst = lapply(X[[1]][["deliveries"]][1:3], fn )
lst
# [[1]]
#   ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1   01    IR Bell  DW Steyn       MJ Prior               0              0             0
# 
# [[2]]
#   ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1   02    IR Bell  DW Steyn       MJ Prior               0              0             0
# 
# [[3]]
#   ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1   03    IR Bell  DW Steyn       MJ Prior               3              0             3

因此，我们最终得到了一个list元素，用于一局中的每一次交付。然后我们使用rbind.fill创建一个data.frame。

如果我要尝试解析每个yaml文件，我将使用一个循环。

使用前三条记录作为示例，并添加匹配日期。

tmp <- unzip(temp)[2:4]

all_raw_dat <- vector("list", length=length(tmp))

for(i in seq_along(tmp)) {
      d = yaml.load_file(tmp[i])
      all_raw_dat[[i]] <- cbind(date=d$info$date, plyr::rbind.fill(lapply(d$innings, p_fun)))
}

然后使用rbind.fill。

Q1。从评论中

使用rbind.fill的一个小例子

a <- data.frame(x=1, y=2)
b <- data.frame(x=2, z=1)

rbind(a,b) # error as names dont match
plyr::rbind.fill(a, b)

rbind.fill不会返回并在需要的地方添加/更新额外列的行(a仍然没有列z)，把它看作是创建一个空的dataframe，它的列数等于dataframes - unique(c(names(a), names(b)))列表中的唯一列数。然后，在可能的情况下，在每一行中填充这些值，否则将丢失(NA)。

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32285145

复制

相似问题

问将多个不均匀嵌套列表转换为R中的DataFrame
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将多个不均匀嵌套列表转换为R中的DataFrameEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将多个不均匀嵌套列表转换为R中的DataFrame
EN