文章/答案/技术大牛

发布

社区首页 >问答首页 >在data.frame列上循环生成R中的虚拟变量

问在data.frame列上循环生成R中的虚拟变量
EN

Stack Overflow用户

提问于 2021-01-16 00:04:30

回答 1查看 425关注 0票数 2

我正在努力为当前的项目生成一个变量。我在Windows上使用RVersion4.0.1。

数据描述

我在data.table中有不平衡的面板数据，其中包含243个变量(在运行命令之前)和8278个观察结果。数据由ID和period唯一标识。此外，在列69:135 I中，获得了不同的区域虚拟(2=是的，公司在区域中操作；1= no，公司不在区域中操作)和列中的178:244滞后版本，这些变量来自按ID分组的列69:135。

dat <- 
data.table(id = as.factor(c(rep("C001", 3), "C002", rep("C003", 5), rep("C004", 2), rep("C005", 7))),
period = as.factor(c(1, 2, 3, 2, 1, 4, 5, 6, 10, 3, 4, 2, 3, 4, 7, 8, 9, 10)),
region1 = as.factor(c(NA, NA, 2, 1, NA, 1, 2, 2, 1, NA, 1, rep(NA, 7))),
region2 = as.factor(c(1, 2, 1, 1, NA, NA, 2, 1, 2, 1, 1, rep(NA, 7))),
industry = as.factor(c(rep("Finance", 3), "Culture", rep("Nutrition", 5), rep("Finance", 2), rep("Medicine", 7))),
number_employees = as.numeric(c(10, 10, 12, 2, 2, 4, 4, 4, 4, 18, 25, 100, 110, 108, 108, 120, 120, 120)),
lag_region1 = as.factor(c(rep(NA, 6), 1, 2, 2, rep(NA, 9))),
lag_region2 = as.factor(c(NA, 1, 2, rep(NA, 4), 2, 1, NA, 1, rep(NA, 7))))


#this gives (last 8 rows are not printed):
#      id period region1 region2  industry number_employees lag_region1 lag_region2
# 1: C001      1    <NA>       1   Finance               10        <NA>        <NA>
# 2: C001      2    <NA>       2   Finance               10        <NA>           1
# 3: C001      3       2       1   Finance               12        <NA>           2
# 4: C002      2       1       1   Culture                2        <NA>        <NA>
# 5: C003      1    <NA>    <NA> Nutrition                2        <NA>        <NA>
# 6: C003      4       1    <NA> Nutrition                4        <NA>        <NA>
# 7: C003      5       2       2 Nutrition                4           1        <NA>
# 8: C003      6       2       1 Nutrition                4           2           2
# 9: C003     10       1       2 Nutrition                4           2           1
#10: C004      3    <NA>       1   Finance               18        <NA>        <NA>

期望结果

我想要生成一个新的虚拟变量left_region，它等于“是”时，公司已经离开了至少一个地区在各自的期间。我想通过“比较”第69栏与第178栏、第70栏至第179栏、第71栏至第180栏等来处理这个问题。如果是，left_region应该设置为“是”，例如dt[, 69] == 1 & dt[, 178] == 2 (因此，如果一家公司离开它以前经营的地区，left_region就等于“是”)。期望的结果如下所示：

# desired result (last 8 rows are not printed):
#      id period region1 region2  industry number_employees lag_region1 lag_region2 left_region
# 1: C001      1    <NA>       1   Finance               10        <NA>        <NA>          no
# 2: C001      2    <NA>       2   Finance               10        <NA>           1          no
# 3: C001      3       2       1   Finance               12        <NA>           2         yes
# 4: C002      2       1       1   Culture                2        <NA>        <NA>          no
# 5: C003      1    <NA>    <NA> Nutrition                2        <NA>        <NA>          no
# 6: C003      4       1    <NA> Nutrition                4        <NA>        <NA>          no
# 7: C003      5       2       2 Nutrition                4           1        <NA>          no
# 8: C003      6       2       1 Nutrition                4           2           2         yes
# 9: C003     10       1       2 Nutrition                4           2           1         yes
#10: C004      3    <NA>       1   Finance               18        <NA>        <NA>          no

问题描述

不过，我正努力让这一切同时进行。我在for循环中使用了for。为此，我必须首先使我的data.table成为一个data.frame。

# generate empty cells
df <- data.frame(matrix(NA, nrow = 8278, ncol = 67))
# combine prior data.table and new data.frame in large data.frame (with data.table the following loop does not work)
dt <- as.data.frame(cbind(dt, df))

# loop through 67 columns comparing 69 to 178, 70 to 179, etc.
for (i in 69:135) {
 dt[, i + 176] <- ifelse(is.na(dt[, i]) & is.na(dt[, (i + 109)]), NA,
         ifelse(dt[, i] == 1 & dt[, (i + 109)] == 2, "yes", "no"
         )
  )
}

# generate final dummy variable left_region --> there is some error here
dt$left_region <-
  ifelse(any(dt[, c(245:311)] == "yes"), "yes", "no")

但是，结合运行最后一个ifelse()和any()，将导致对8278个观测服务中的每一个只包含“是”的left_region。

如果只使用一个观察，我测试了后一个ifelse()命令的行为。

#take out one observation
one_row <- dt[7, ]

library(dplyr)
# generate left_region for one observation only
new <- 
  one_row %>%
  mutate(left_region = ifelse(any(one_row[, c(245:311)] == "yes"), "yes", "no"))

选中的观察结果应该会生成left_region== "no“，但在这种情况下恰恰相反。在某种程度上，ifelse()的最后一个论点“否”似乎不是R.

除了不是解决问题的“漂亮”解决方案之外，将ifelse()和any()组合到for()循环中也不能解决这个问题。在这种情况下，left_region只在270个案例中使用“是”，但始终没有“否”。

for (i in 1:nrow(dt)) {
  dt$left_region[i] <-
    ifelse(any(dt[i, c(245:311)] == "yes"), "yes", "no")
}

，有人知道为什么会这样吗？我需要做些什么才能得到我想要的结果？任何想法都是非常感谢的！

我非常希望我能以一种容易理解的方式解释一切。非常感谢提前！

loops

if-statement

panel-data

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-01-16 04:42:29

如果69:135列中的值为1，则dt[, 69:135] == 1将返回TRUE，否则返回FALSE。

如果第178:244列中的值为2，则dt[, 178:244] == 2将返回TRUE，否则返回FALSE。

您可以在它们之间执行和(&)操作，按元素比较它们，即dt[, 69] & dt[, 178]、dt[, 70] & dt[, 179]等等。取它们的行和，并将其标记为'Yes'，即使在该行中找到单个TRUE。

dt$left_region <- ifelse(rowSums(dt[, 69:135] == 1 & dt[, 178:244] == 2) > 0, 'yes', 'no')

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65745117

复制

相似问题

问在data.frame列上循环生成R中的虚拟变量
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在data.frame列上循环生成R中的虚拟变量EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在data.frame列上循环生成R中的虚拟变量
EN