文章/答案/技术大牛

发布

社区首页 >问答首页 >dplyr:如何在同时创建新的协变量的同时使用多个列进行pivot_wider？

问dplyr:如何在同时创建新的协变量的同时使用多个列进行pivot_wider？
EN

Stack Overflow用户

提问于 2021-07-20 18:08:52

回答 2查看 204关注 0票数 0

我有一个蛋白质组数据集，目前有60列(病人和信息，如蛋白质名称)和1800行(特定的蛋白质)。

我需要从长格式转换为宽格式，这样每一行都对应于病人，而所有的列都代表蛋白质。我可以做(非常)简单的转换，但是在这个示例中有很多列，而且在扩展过程中，需要进行一些数据管理，因为需要从下面的原始蛋白质组输出中创建/提取新的协变量。我只是不知道如何继续，也没有找到任何解决方案，寻找许多可用的演练转换这样的大型数据集。

我更喜欢dplyr-inputs、提示或解决方案。

蛋白质组软件的原始输出如下所示：

> head(Heat_BT)
# A tibble: 11 x 6
   protein                                        gene   Intensity_10 Intensity_11 Intensity_MB_1 Intensity_Ref1
   <chr>                                          <chr>  <chr>        <chr>        <chr>          <chr>         
 1 NA                                             NA     Bruschi      Bruschi      Reichl         Reichl        
 2 NA                                             NA     Ctrl         Ctrl         Tumor          Ctrl          
 3 NA                                             NA     Hydro        Hydro        Malignant      Hydro         
 4 NA                                             NA     Ctrl         Ctrl         MB             Ctrl          
 5 von Willebrand factor                          VWF    0.674627721  0.255166769  0.970489979    0.215972215   
 6 Sex hormone-binding globulin                   SHBG   0.516914487  0.476843655  0.88173753     0.306484252   
 7 Glyceraldehyde-3-phosphate dehydrogenase       GAPDH  0.622163594  0.231107563  0.71856463     0.204625234   
 8 Nestin                                         NES    0.868476391  0.547319174  0.832109928    0.440162212   
 9 Heat shock 70 kDa protein 13                   HSPA13 0.484973907  0.435322136  0.539334834    0.28678757    
10 Isocitrate dehydrogenase [NADP], mitochondrial IDH2   1.017596364  0.107395157  0.710225344    0.251976997   
11 Mannan-binding lectin serine protease 1        MASP1  0.491321206  0.434995681  0.812500775    0.403583705

预期输出：

              id     lab malig      diag       VWF      SHBG     GAPDH       NES    HSPA13      IDH2     MASP1
1   Intensity_10 Bruschi  Ctrl     Hydro 0.6746277 0.5169145 0.6221636 0.8684764 0.4849739 1.0175964 0.4913212
2   Intensity_11 Bruschi  Ctrl     Hydro 0.2551668 0.4768437 0.2311076 0.5473192 0.4353221 0.1073952 0.4349957
3 Intensity_MB_1  Reichl Tumor Malignant 0.9704900 0.8817375 0.7185646 0.8321099 0.5393348 0.7102253 0.8125008
4 Intensity_Ref1  Reichl  Ctrl     Hydro 0.2159722 0.3064843 0.2046252 0.4401622 0.2867876 0.2519770 0.4035837

蛋白质组学软件

自动将前四行打印为类别，每个病人都属于.

根据前四行：

必须在宽格式中添加四个新的协变量：(1) Heat_BT$id对应每个病人的研究名称，(2) Heat_BT$lab对应于实验室产生的数据，(3)<代码>E 218 Heat_BT$malig对应于病人病例是对照组还是肿瘤病例，最后，E 120(4)E 221d22对应于基础诊断。H 223F 224

数据

Heat_BT <- structure(list(protein = c(NA, NA, NA, NA, "von Willebrand factor", 
                           "Sex hormone-binding globulin", "Glyceraldehyde-3-phosphate dehydrogenase", 
                           "Nestin", "Heat shock 70 kDa protein 13", "Isocitrate dehydrogenase [NADP], mitochondrial", 
                           "Mannan-binding lectin serine protease 1"), gene = c(NA, NA, 
                                                                                NA, NA, "VWF", "SHBG", "GAPDH", "NES", "HSPA13", "IDH2", "MASP1"
                           ), Intensity_10 = c("Bruschi", "Ctrl", "Hydro", "Ctrl", "0.674627721", 
                                               "0.516914487", "0.622163594", "0.868476391", "0.484973907", "1.017596364", 
                                               "0.491321206"), Intensity_11 = c("Bruschi", "Ctrl", "Hydro", 
                                                                                "Ctrl", "0.255166769", "0.476843655", "0.231107563", "0.547319174", 
                                                                                "0.435322136", "0.107395157", "0.434995681"), Intensity_MB_1 = c("Reichl", 
                                                                                                                                                 "Tumor", "Malignant", "MB", "0.970489979", "0.88173753", "0.71856463", 
                                                                                                                                                 "0.832109928", "0.539334834", "0.710225344", "0.812500775"), 
               Intensity_Ref1 = c("Reichl", "Ctrl", "Hydro", "Ctrl", "0.215972215", 
                                  "0.306484252", "0.204625234", "0.440162212", "0.28678757", 
                                  "0.251976997", "0.403583705")), row.names = c(NA, -11L), class = c("tbl_df", 
                                                                                                     "tbl", "data.frame"))

pivot

dataframe

dplyr

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-07-20 19:06:42

这里有一个dplyr解决方案。这是两个步骤，因为您需要首先收集intensity-variables。

Heat_BT <- Heat_BT %>% na.exclude()

Heat_BT[,-1] %>% pivot_longer(
        cols = Intensity_10:Intensity_Ref1,
        names_to = "id"
) %>% pivot_wider(
        names_from = gene
) %>% mutate(
        across(.cols = -"id", as.numeric)
)

，它提供了以下output

# A tibble: 4 x 8
  id             VWF         SHBG        GAPDH       NES         HSPA13      IDH2        MASP1      
  <chr>          <chr>       <chr>       <chr>       <chr>       <chr>       <chr>       <chr>      
1 Intensity_10   0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11   0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 0.970489979 0.88173753  0.71856463  0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757  0.251976997 0.403583705

我很难看到您想要从variables中添加的data之间的连接，所以我假设，一旦您能够正确地pivot您的数据，您将能够填写其余的。

我很高兴地修改我的答案，如果你能更清楚地解释这些变量之间的关系。

最好的

编辑：注意到我从data中删除了前四行，因为我没有立即看到您想要添加的变量之间的连接。

编辑2:我假设前3行是您想要添加的协变量，因此第一行分别是lab、malig和diag。

# Extract the relevant information
# from the data.
id_cols <- bind_cols(
        var = c("lab", "malig", "diag"),
        Heat_BT[1:3,-c(1,2)] 
) %>% group_by(var) %>% pivot_longer(
        cols = Intensity_10:Intensity_Ref1, names_to = "id"
) %>% pivot_wider(
        names_from = var,
)
        
        
# Remove these identifiers;
Heat_BT <- Heat_BT %>% na.exclude() 

# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
        cols = Intensity_10:Intensity_Ref1,names_to = "id"
) %>% pivot_wider(
        names_from = gene,
) %>% mutate(
        across(.cols = -"id", as.numeric)
        )

# Join with the ID colums
left_join(
        id_cols,
        pivoted_table
)

这给了output

# A tibble: 4 x 11
  id             lab     malig diag      VWF         SHBG        GAPDH       NES         HSPA13      IDH2        MASP1      
  <chr>          <chr>   <chr> <chr>     <chr>       <chr>       <chr>       <chr>       <chr>       <chr>       <chr>      
1 Intensity_10   Bruschi Ctrl  Hydro     0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11   Bruschi Ctrl  Hydro     0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 Reichl  Tumor Malignant 0.970489979 0.88173753  0.71856463  0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 Reichl  Ctrl  Hydro     0.215972215 0.306484252 0.204625234 0.440162212 0.28678757  0.251976997 0.403583705

这将与您拥有的数据一起工作，而不管大小如何。显然，您可以通过将cols = Intensity_10:Intensity_Ref1替换为contains("intensity")来使该方法更好地防弹。

编辑3

您的变量比这里提供的要多得多，所以当您pivot时，在pivot-process期间不会修改这些变量。

因此，我们可以采取更健壮的方法，假设这里没有提供的所有variables都与提供的类似，方法是相应地更改cols-argument。

# Extract the relevant information
# from the data.
id_cols <- bind_cols(
        var = c("lab", "malig", "diag"),
        Heat_BT[1:3,-c(1,2)] 
) %>% group_by(var) %>% pivot_longer(
        cols = -"var", names_to = "id"
) %>% pivot_wider(
        names_from = var,
)


# Remove these identifiers;
Heat_BT <- Heat_BT[-(1:4),]

# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
        cols = -"gene",
        names_to = "id"
) %>% pivot_wider(
        names_from = gene,
) %>% mutate(
        across(.cols = -"id", as.numeric)
)

# Join with the ID colums
left_join(
        id_cols,
        pivoted_table
)

它提供了与上面相同的输出。

票数 1

Stack Overflow用户

发布于 2021-07-20 18:34:23

你可以这样做：

Heat_BT[,2][1:3] <- c('lab', 'malig', 'diag')
data.table::transpose(Heat_BT[,-1],keep.names = 'gene',make.names = TRUE)

            gene     lab malig      diag   NA         VWF        SHBG       GAPDH         NES      HSPA13        IDH2       MASP1
1   Intensity_10 Bruschi  Ctrl     Hydro Ctrl 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2   Intensity_11 Bruschi  Ctrl     Hydro Ctrl 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1  Reichl Tumor Malignant   MB 0.970489979  0.88173753  0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1  Reichl  Ctrl     Hydro Ctrl 0.215972215 0.306484252 0.204625234 0.440162212  0.28678757 0.251976997 0.403583705

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68459534

复制

相似问题

问dplyr:如何在同时创建新的协变量的同时使用多个列进行pivot_wider？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问dplyr:如何在同时创建新的协变量的同时使用多个列进行pivot_wider？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问dplyr:如何在同时创建新的协变量的同时使用多个列进行pivot_wider？
EN