文章/答案/技术大牛

发布

问计算csv中的聚类值
EN

Stack Overflow用户

提问于 2015-04-02 04:49:20

回答 1查看 108关注 0票数 1

我有一个csv文件，其中的行包含一个名称，后面跟着一系列的空值和聚集的实值。

Robert,,,1:00-5:00,1:00-5:00,1:00-5:00,,,,,,2:00-4:00,2:00-4:00,2:00-4:00
John,,,1:00-5:00,1:00-5:00,,,,,,,,,,,,
Casey,,,1:00-5:00,1:00-5:00,1:00-5:00,,,,,,2:00-4:00,2:00-4:00,,,
Sarah,,,1:00-5:00,,,,,,,,2:00-4:00,2:00-4:00,2:00-4:00,,

我想用R写一个脚本来计算星系团。如果行中有三个真正的顺序值，那么我想将它们算为“一个”集群。如果有一个小于3的集群(即一个或两个顺序值)，那么我想将其计算为“一个”单独的集群。

csv格式的期望输出：

Robert,2,0
John,0,1
Casey,1,1
Sarah,1,1

来自评论的编辑

代码导入的csv确实有一个标题，但是我希望代码忽略标题并从第一行(即Robert，,1:00-5:00，.)读取。我还想忽略导入csv文件的最后一列，该列包含每个人工作的总时数。下面是一个带有指向csv：report.csv示例链接的github

Employee,"Mar 23, 2015","Mar 24, 2015","Mar 25, 2015","Mar 26, 2015","Mar 27, 2015","Mar 28, 2015","Mar 29, 2015",total hours
"John Smith",16:35 - 21:17 / 4.7,16:35 - 21:17 / 4.7,16:35 - 21:17 / 4.7,,,,11:17 - 16:08 / 4.85,18.9569
"Emily Smith",,,,,,08:13 - 12:40 / 4.45,,4.4472222222222
"Robert Jenkins",16:54 - 21:11 / 4.29,16:54 - 21:11 / 4.29,,,16:22 - 22:59 / 6.61,,,15.18638
"Rachel Lipscomb",,,,,,13:18 - 19:04 / 5.76,,5.7638888888889
"Donald Driver",,,,,08:13 - 13:05 / 4.86,08:13 - 13:05 / 4.86,10:02 - 16:02 / 6,15.14694

csv

count

gaps-and-islands

回答 1

Stack Overflow用户

发布于 2019-06-04 17:15:41

下面是一个解决这个老问题的可能的data.table解决方案，它使用

fread()用于读取输入文件，
melt() / dcast()用于整形，
和rleid()函数来识别缺口和岛屿。

对于问题中发布的数据集，以下代码

library(data.table)
library(magrittr)

fread("input.csv", header = FALSE, na.strings = c(""), fill = TRUE) %>% 
  .[, V1 := forcats::fct_inorder(V1)] %>%  # to keep the original order in dcast() below
  melt(id.var = "V1") %>% 
  setorder(V1, variable) %>% 
  .[, cluster.id := rleid(V1, is.na(value))] %>%
  .[!is.na(value), .N, by = .(V1, cluster.id)] %>% 
  dcast(V1 ~ N < 3, length, value.var = "N") %>% 
  fwrite("output.csv", col.names = FALSE)

根据请求创建csv文件：

罗伯特，2，0约翰，0，1凯西，1，1莎拉，1，1

在评论中，OP提供了一个指向github上的另一个示例数据集的链接。

稍作修改，

fread("https://raw.githubusercontent.com/agrobins/r_IslandCount/test_files/timeclock_report.csv"
      , drop = "total hours", na.strings = c("")) %>% 
  .[, Employee := forcats::fct_inorder(Employee)] %>%  # to keep the original order in dcast() below
  melt(id.var = "Employee") %>% 
  setorder(Employee, variable) %>% 
  .[, cluster.id := rleid(Employee, is.na(value))] %>% 
  .[!is.na(value), .N, .(Employee, cluster.id)] %>% 
  dcast(Employee ~ N < 3, length, value.var = "N")

我们会得到

Employee FALSE TRUE 1: John Smith 1 1 2: Emily Smith 0 1 3: Robert Jenkins 0 2 4: Rachel Lipscomb 0 1 5: Donald Driver 1 0

名为FALSE的第一个数值列包含由三个或更多个连续条目组成的群集数，而第二个名为TRUE的数字列包含由一个或两个连续条目组成的群集数。

可复制数据

由于指向外部网站的链接是脆弱的，下面是从

report.csv

雇员：“2015年3月23日”、“2015年3月24日”、“2015年3月25日”、“2015年3月26日”、“2015年3月27日”、“2015年3月28日”、“2015年3月29日”、“约翰·史密斯”、16:35-21:17/ 4.7、16:35-21:17/4.7、08:13-12:40/ 4.45，4.4472222222222“罗伯特·詹金斯”，16:54- 21:11 / 4.29,16:54 - 21:11 / 4.29，16:22-22:59/ 6.61，，,15.18638“雷切尔·利普科姆”，、，、13:18- 19:04 / 5.76，5.7638888888889“唐纳德司机”，08:13-13:05/4.86，08:13-13:05/ 4.86,10:02 - 16:02 / 6,15.14694

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/29405159

复制

相似问题

问计算csv中的聚类值
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问计算csv中的聚类值EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问计算csv中的聚类值
EN