一文看懂PCA主成分分析中介绍了PCA分析的原理和分析的意义(基本简介如下,更多见博客),今天就用数据来实际操练一下。
在公众号后台回复“PCA实战”,获取测试数据。
# 加载需要用到的R包library(psych)
library(reshape2)
library(ggplot2)
library(factoextra)
# 基因表达数据
exprData <- "ehbio_salmon.DESeq2.normalized.symbol.txt"
# 非必须
sampleFile <- "sampleFile"
# 为了保证文章的使用,文末附有数据的新下载链接,以防原链接失效
data <- read.table(exprData, header=T, row.names=NULL,sep="\t")
# 处理重复名字,谨慎处理,先找到名字重复的原因再决定是否需要按一下方式都保留
rownames_data <- make.names(data[,1],unique=T)
data <- data[,-1,drop=F]
rownames(data) <- rownames_data
data <- data[rowSums(data)>0,]
# 去掉方差为0 的行,这些本身没有意义,也妨碍后续运算
data <- data[apply(data, 1, var)!=0,]
# 计算中值绝对偏差 (MAD, median absolute deviation)度量基因表达变化幅度
# 在基因表达中,尽管某些基因很小的变化会导致重要的生物学意义,
# 但是很小的观察值会引入很大的背景噪音,因此也意义不大。
# 可以选择根据mad值过滤,取top 50, 500等做分析
mads <- apply(data, 1, mad)
data <- data[rev(order(mads)),]
# 此处未选
# data <- data[1:500,]
dim(data)
# Pay attention to the format of PCA input
# Rows are samples and columns are variables
data_t <- t(data)
variableL <- ncol(data_t)
if(sampleFile != "") {
sample <- read.table(sampleFile,header = T, row.names=1,sep="\t")
data_t_m <- merge(data_t, sample, by=0)
rownames(data_t_m) <- data_t_m$Row.names
data_t <- data_t_m[,-1]
}
# By default, prcomp will centralized the data using mean.
# Normalize data for PCA by dividing each data by column standard deviation.
# Often, we would normalize data.
# Only when we care about the real number changes other than the trends,
# `scale` can be set to TRUE.
# We will show the differences of scaling and un-scaling effects.
pca <- prcomp(data_t[,1:variableL], scale=T)
# sdev: standard deviation of the principle components.
# Square to get variance
# percentVar <- pca$sdev^2 / sum( pca$sdev^2)
# To check what's in pca
print(str(pca))
# PCA结果提取和可视化神器
# http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/
library(factoextra)
1. 碎石图展示每个主成分的贡献
# 如果需要保持,去掉下面第1,3行的注释
#pdf("1.pdf")
fviz_eig(pca, addlabels = TRUE)
#dev.off()
2. PCA样品聚类信息展示
# repel=T,自动调整文本位置
fviz_pca_ind(pca, repel=T)
3. 根据样品分组上色
# 根据分组上色并绘制
fviz_pca_ind(pca, col.ind=data_t$conditions, mean.point=F, addEllipses = T, legend.title="Groups")
4. 增加不同属性的椭圆
# 根据分组上色并绘制95%置信区间
fviz_pca_ind(pca, col.ind=data_t$conditions, mean.point=F, addEllipses = T, legend.title="Groups", ellipse.type="confidence", ellipse.level=0.95)
5. 展示贡献最大的变量 (基因)
1) 展示与主坐标轴的相关性大于0.99的变量 (具体数字自己调整)
# Visualize variable with cos2 >= 0.99
fviz_pca_var(pca, select.var = list(cos2 = 0.99), repel=T, col.var = "cos2", geom.var = c("arrow", "text") )
2)展示与主坐标轴最相关的10个变量
# Top 10 active variables with the highest cos2
fviz_pca_var(pca, select.var= list(cos2 = 10), repel=T, col.var = "contrib")
3)展示自己关心的变量(基因)与主坐标轴的相关性分布
# Select by names
# 这里选择的是MAD值最大的几个基因
name <- list(name = c("FN1", "DCN", "CEMIP","CCDC80","IGFBP5","COL1A1","GREM1"))
fviz_pca_var(pca, select.var = name)
6. biPLot同时展示样本分组和关键基因
# top 5 contributing individuals and variable
fviz_pca_biplot(pca,
fill.ind=data_t$conditions,
palette="joo",
addEllipses = T,
ellipse.type="confidence",
ellipse.level=0.95,
mean.point=F,
col.var="contrib",
gradient.cols = "RdYlBu",
select.var = list(contrib = 10),
ggtheme = theme_minimal())
(1,2,3,4)
的方差是1.67
,而(10,20,30,40)
的方差就是167
,数据变大10倍,方差放大了100倍。