我有一个蛋白质组数据集,目前有60列(病人和信息,如蛋白质名称)和1800行(特定的蛋白质)。
我需要从长格式转换为宽格式,这样每一行都对应于病人,而所有的列都代表蛋白质。我可以做(非常)简单的转换,但是在这个示例中有很多列,而且在扩展过程中,需要进行一些数据管理,因为需要从下面的原始蛋白质组输出中创建/提取新的协变量。我只是不知道如何继续,也没有找到任何解决方案,寻找许多可用的演练转换这样的大型数据集。
我更喜欢dplyr
-inputs、提示或解决方案。
蛋白质组软件的原始输出如下所示:
> head(Heat_BT)
# A tibble: 11 x 6
protein gene Intensity_10 Intensity_11 Intensity_MB_1 Intensity_Ref1
<chr> <chr> <chr> <chr> <chr> <chr>
1 NA NA Bruschi Bruschi Reichl Reichl
2 NA NA Ctrl Ctrl Tumor Ctrl
3 NA NA Hydro Hydro Malignant Hydro
4 NA NA Ctrl Ctrl MB Ctrl
5 von Willebrand factor VWF 0.674627721 0.255166769 0.970489979 0.215972215
6 Sex hormone-binding globulin SHBG 0.516914487 0.476843655 0.88173753 0.306484252
7 Glyceraldehyde-3-phosphate dehydrogenase GAPDH 0.622163594 0.231107563 0.71856463 0.204625234
8 Nestin NES 0.868476391 0.547319174 0.832109928 0.440162212
9 Heat shock 70 kDa protein 13 HSPA13 0.484973907 0.435322136 0.539334834 0.28678757
10 Isocitrate dehydrogenase [NADP], mitochondrial IDH2 1.017596364 0.107395157 0.710225344 0.251976997
11 Mannan-binding lectin serine protease 1 MASP1 0.491321206 0.434995681 0.812500775 0.403583705
预期输出:
id lab malig diag VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
1 Intensity_10 Bruschi Ctrl Hydro 0.6746277 0.5169145 0.6221636 0.8684764 0.4849739 1.0175964 0.4913212
2 Intensity_11 Bruschi Ctrl Hydro 0.2551668 0.4768437 0.2311076 0.5473192 0.4353221 0.1073952 0.4349957
3 Intensity_MB_1 Reichl Tumor Malignant 0.9704900 0.8817375 0.7185646 0.8321099 0.5393348 0.7102253 0.8125008
4 Intensity_Ref1 Reichl Ctrl Hydro 0.2159722 0.3064843 0.2046252 0.4401622 0.2867876 0.2519770 0.4035837
蛋白质组学软件
根据前四行:
Heat_BT$id
对应每个病人的研究名称,(2) Heat_BT$lab
对应于实验室产生的数据,(3)<代码>E 218 Heat_BT$malig
对应于病人病例是对照组还是肿瘤病例,最后,E 120
(4)E 221d22对应于基础诊断。H 223F 224
数据
Heat_BT <- structure(list(protein = c(NA, NA, NA, NA, "von Willebrand factor",
"Sex hormone-binding globulin", "Glyceraldehyde-3-phosphate dehydrogenase",
"Nestin", "Heat shock 70 kDa protein 13", "Isocitrate dehydrogenase [NADP], mitochondrial",
"Mannan-binding lectin serine protease 1"), gene = c(NA, NA,
NA, NA, "VWF", "SHBG", "GAPDH", "NES", "HSPA13", "IDH2", "MASP1"
), Intensity_10 = c("Bruschi", "Ctrl", "Hydro", "Ctrl", "0.674627721",
"0.516914487", "0.622163594", "0.868476391", "0.484973907", "1.017596364",
"0.491321206"), Intensity_11 = c("Bruschi", "Ctrl", "Hydro",
"Ctrl", "0.255166769", "0.476843655", "0.231107563", "0.547319174",
"0.435322136", "0.107395157", "0.434995681"), Intensity_MB_1 = c("Reichl",
"Tumor", "Malignant", "MB", "0.970489979", "0.88173753", "0.71856463",
"0.832109928", "0.539334834", "0.710225344", "0.812500775"),
Intensity_Ref1 = c("Reichl", "Ctrl", "Hydro", "Ctrl", "0.215972215",
"0.306484252", "0.204625234", "0.440162212", "0.28678757",
"0.251976997", "0.403583705")), row.names = c(NA, -11L), class = c("tbl_df",
"tbl", "data.frame"))
发布于 2021-07-20 19:06:42
这里有一个dplyr
解决方案。这是两个步骤,因为您需要首先收集intensity
-variables。
Heat_BT <- Heat_BT %>% na.exclude()
Heat_BT[,-1] %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1,
names_to = "id"
) %>% pivot_wider(
names_from = gene
) %>% mutate(
across(.cols = -"id", as.numeric)
)
,它提供了以下output
# A tibble: 4 x 8
id VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Intensity_10 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
我很难看到您想要从variables
中添加的data
之间的连接,所以我假设,一旦您能够正确地pivot
您的数据,您将能够填写其余的。
我很高兴地修改我的答案,如果你能更清楚地解释这些变量之间的关系。
最好的
编辑:注意到我从data
中删除了前四行,因为我没有立即看到您想要添加的变量之间的连接。
编辑2:我假设前3行是您想要添加的协变量,因此第一行分别是lab
、malig
和diag
。
# Extract the relevant information
# from the data.
id_cols <- bind_cols(
var = c("lab", "malig", "diag"),
Heat_BT[1:3,-c(1,2)]
) %>% group_by(var) %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1, names_to = "id"
) %>% pivot_wider(
names_from = var,
)
# Remove these identifiers;
Heat_BT <- Heat_BT %>% na.exclude()
# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1,names_to = "id"
) %>% pivot_wider(
names_from = gene,
) %>% mutate(
across(.cols = -"id", as.numeric)
)
# Join with the ID colums
left_join(
id_cols,
pivoted_table
)
这给了output
# A tibble: 4 x 11
id lab malig diag VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Intensity_10 Bruschi Ctrl Hydro 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 Bruschi Ctrl Hydro 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 Reichl Tumor Malignant 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 Reichl Ctrl Hydro 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
这将与您拥有的数据一起工作,而不管大小如何。显然,您可以通过将cols = Intensity_10:Intensity_Ref1
替换为contains("intensity")
来使该方法更好地防弹。
编辑3
您的变量比这里提供的要多得多,所以当您pivot
时,在pivot
-process期间不会修改这些变量。
因此,我们可以采取更健壮的方法,假设这里没有提供的所有variables
都与提供的类似,方法是相应地更改cols
-argument。
# Extract the relevant information
# from the data.
id_cols <- bind_cols(
var = c("lab", "malig", "diag"),
Heat_BT[1:3,-c(1,2)]
) %>% group_by(var) %>% pivot_longer(
cols = -"var", names_to = "id"
) %>% pivot_wider(
names_from = var,
)
# Remove these identifiers;
Heat_BT <- Heat_BT[-(1:4),]
# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
cols = -"gene",
names_to = "id"
) %>% pivot_wider(
names_from = gene,
) %>% mutate(
across(.cols = -"id", as.numeric)
)
# Join with the ID colums
left_join(
id_cols,
pivoted_table
)
它提供了与上面相同的输出。
发布于 2021-07-20 18:34:23
你可以这样做:
Heat_BT[,2][1:3] <- c('lab', 'malig', 'diag')
data.table::transpose(Heat_BT[,-1],keep.names = 'gene',make.names = TRUE)
gene lab malig diag NA VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
1 Intensity_10 Bruschi Ctrl Hydro Ctrl 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 Bruschi Ctrl Hydro Ctrl 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 Reichl Tumor Malignant MB 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 Reichl Ctrl Hydro Ctrl 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
https://stackoverflow.com/questions/68459534
复制相似问题