蛋白组学分析包——DEqMS学习

james_wang

发布于 2022-03-25 16:52:14

1.9K0

发布于 2022-03-25 16:52:14

在初次接触蛋白组学的数据之时，外观上，其数据格式与我们常见的基因表达测序矩阵文件没有什么的不同。事实上公司采用的差异蛋白分析方式也是最基础的T-test，但由于蛋白和核苷酸相差甚远且上机方法也大有不同，故怀着好奇上网冲浪。目前主要的蛋白组学分析R工具有如下三款：

①limma；②DEqMS；③DEP

本次主要对DEqMS展开学习：

tutorial官网：http://www.bioconductor.org/packages/release/bioc/vignettes/DEqMS/inst/doc/DEqMS-package-vignette.html#download-and-read-the-input-protein-table

1 Overview

DEqMS builds on top of Limma, a widely-used R package for microarray data analysis (Smyth G. et al 2004), and improves it with proteomics data specific properties, accounting for variance dependence on the number of quantified peptides or PSMs for statistical testing of differential protein expression.

DEqMS是基于limma包建立的蛋白组分析R语言工具，教程摘要中PSMs的定义为：peptide spectrum matches——即“肽匹配图谱”1

PSM理论解释：为鉴定肽段匹配到数据库内的蛋白质的理论酶切肽段图谱数（或通过算法对二者相似度评分后，分值最高的理论肽段即作为鉴定结果），或显示蛋白质的已识别肽段序列数（包括多次被识别的序列）。故该方法在蛋白组检测中既可以定性，又可以定量，但定性依赖于数据库的数据构成2。

2 prepare the input protein data

2.1下载文件——以TMT数据为例

url <- "https://ftp.ebi.ac.uk/pride-archive/2016/06/PXD004163/Yan_miR_Protein_table.flatprottable.txt"

download.file(url, destfile = "./miR_Proteintable.txt",method = "auto")

df.prot = read.table("miR_Proteintable.txt",stringsAsFactors = FALSE, header = TRUE, quote = "", comment.char = "",sep = "\t")

2.2数据过滤

# filter at 1% protein FDR and extract TMT quantifications
TMT_columns = seq(15,33,2)      #选择15-33列中、间隔1列的数据，也就是所有的COUNT数据
dat = df.prot[df.prot$miR.FASP_q.value<0.01,TMT_columns]    #筛选q值小于0.01的数据
rownames(dat) = df.prot[df.prot$miR.FASP_q.value<0.01,]$Protein.accession

# The protein dataframe is a typical protein expression matrix structure
# Samples are in columns, proteins are in rows
# use unique protein IDs for rownames
# to view the whole data frame, use the command View(dat)

View(dat)

3.Data processing and grouping

3.1数据处理

dat.log = log2(dat)    #数据取对数
dat.log = na.omit(dat.log)    #remove rows with NAs
View(dat.log)
boxplot(dat.log,las=2,main="TMT10plex data PXD004163")    #Use boxplot to check if the samples have medians centered. if not, do median centering.

# Here the data is already median centered, we skip the following step. 
#由于数据表达量足够一致，所以可以不进行标准化
# dat.log = equalMedianNormalization(dat.log)

3.2数据分组

# if there is only one factor, such as treatment. You can define a vector with
# the treatment group in the same order as samples in the protein table.

cond = as.factor(c("ctrl","miR191","miR372","miR519","ctrl",
                   "miR372","miR519","ctrl","miR191","miR372"))

# The function model.matrix is used to generate the design matrix
design = model.matrix(~0+cond) # 0 means no intercept for the linear model
#这里是设计一个对比矩阵，其做法原理可以详见：https://treeh.cn/?id=21

colnames(design) = gsub("cond","",colnames(design))    #去除表格中的cond字符
View(design)

# you can define one or multiple contrasts here
#设计哪些组间需要对比
x <- c("miR372-ctrl","miR519-ctrl","miR191-ctrl", "miR372-miR519","miR372-miR191","miR519-miR191")
contrast =  makeContrasts(contrasts=x,levels=design)    #按照x的对比方式，对design的样本分组对样本进行比较
View(contrast)

fit1 <- lmFit(dat.log, design)    #线性拟合模型构建
fit2 <- contrasts.fit(fit1,contrasts = contrast)    #Compute Contrasts from Linear Model Fit

#Given a linear model fit to microarray data, compute estimated coefficients and standard errors for a given set of contrasts.
fit3 <- eBayes(fit2)#Empirical Bayes Statistics for Differential Expression
#Given a linear model fit from lmFit, compute moderated t-statistics, moderated F-statistic, and log-odds of differential expression by empirical Bayes moderation of the standard errors towards a global value.
#具体介绍参考自help(contrasts.fit)、help(eBayes)

4.DEqMS analysis

以上的研究是基于limma进行的，以下的教程将以DEqMS包构建的方法，使用实验内和实验间用于量化的最小PSM数量来模拟方差和PSM数量之间的关系。

# assign a extra variable `count` to fit3 object, telling how many PSMs are quantifed for each protein
library(matrixStats)
count_columns = seq(16,34,2)     #选择16-34列中、间隔1列的数据，也就是所有的PSMs数据
psm.count.table = data.frame(count = rowMins(as.matrix(df.prot[,count_columns])), row.names =  df.prot$Protein.accession)

#rowMins: Calculates the minimum for each row (column) of a matrix-like object

fit3$count = psm.count.table[rownames(fit3$coefficients),"count"]    #数据导入fit3中
fit4 = spectraCounteBayes(fit3)
#Peptide/Spectra Count Based Empirical Bayes Statistics for Differential Expression. This function is to calculate peptide/PSM count adjusted t-statistics, p-values.
View(psm.count.table)

fit4新增内容解释：

Outputs of spectraCounteBayes:

object is augmented form of “fit” object from eBayes in Limma, with the additions being:

sca.t - Spectra Count Adjusted posterior t-value

sca.p - Spectra Count Adjusted posterior p-value

sca.dfprior - DEqMS estimated prior degrees of freedom

sca.priorvar- DEqMS estimated prior variance

sca.postvar - DEqMS estimated posterior variance

model - fitted model

# n=30 limits the boxplot to show only proteins quantified by <= 30 PSMs.

VarianceBoxplot(fit4,n=30,main="TMT10plex dataset PXD004163",xlab="PSM count")

#横坐标为不同PSM count值，纵坐标为log值，但是是什么数据的Log值并不清楚
#查看VarianceBoxplot源码，x轴y轴输入如下
#x <- fit$count
#y <- fit$sigma^2，这个sigma值不知道是什么，应该是涉及上机检测的数据？在公司给的数据里面好像有见到类似值？等待一个解答***

VarianceScatterplot(fit4,main="TMT10plex dataset PXD004163")

5.extract the result as a data frame and save

#if you are not sure which coef_col refers to the specific contrast,type
#查看我们设定的对照组
head(fit4$coefficients)

#提取第一个对比组
DEqMS.results = outputResult(fit4,coef_col = 1)#miR372-ctrl
#提取第二个对比组
#DEqMS.results = outputResult(fit4,coef_col = 2)#miR519-ctrl
# a quick look on the DEqMS results table
head(DEqMS.results)
# Save it into a tabular text file
write.table(DEqMS.results,"DEqMS.results.miR372-ctrl.txt",sep = "\t", row.names = F,quote=F)

表格列名的解释（关键可看logFC、adjp、sca.adj.p值）：

Explaination of the columns in DEqMS.results:

logFC - log2 fold change between two groups, Here it’s log2(miR372/ctrl).

AveExpr - the mean of the log2 ratios/intensities across all samples. Since input matrix is log2 ratio values, it is the mean log2 ratios of all samples.

t - Limma output t-statistics

P.Value- Limma p-values

adj.P.Val - BH method adjusted Limma p-values

B - Limma B values

count - PSM/peptide count values you assigned

sca.t - DEqMS t-statistics

sca.P.Value - DEqMS p-values

sca.adj.pval - BH method adjusted DEqMS p-values

6.Other analysis

官网还提供了热图可视化差异蛋白的方法（ggplot2），以及分析label-free数据的教程（和TMT）教程类似。还对t检验/limma/Anova差异分析的结果与DEqMS的结果进行了对照研究，可见DEqMS的包是较为可靠的。

7.总结

DEqMS是一个较为简单易学的差异分析包，难度大的地方还是在对于其上机的一些概念，以及一些数学方面的理解。当然作为临床与生信人，这都不是我们需要去深刻探究的问题，至少先学会应用于基础概念。

目前公司上机处理后会直接提供整理好的表达矩阵文件，不会提供PSM矩阵，但和公司沟通后还是拿得到PSM矩阵进行分析的。

本文为个人学习笔记，如有引起任何侵权问题，请及时与我联系，谢谢。

参考文献：

1Zhu Y, Orre LM, Zhou Tran Y, Mermelekas G, Johansson HJ, Malyutina A, Anders S, Lehtiö J. DEqMS: A Method for Accurate Variance Estimation in Differential Protein Expression Analysis. Mol Cell Proteomics. 2020 Jun;19(6):1047-1057. doi: 10.1074/mcp.TIR119.001646. Epub 2020 Mar 23. PMID: 32205417; PMCID: PMC7261819.

2徐洪凯,闫克强,何燕斌,闻博,杨焕明,刘斯奇.宏蛋白质组学信息分析的基本策略及其挑战J.生物化学与生物物理进展,2018,45(01):23-35.DOI:10.16476/j.pibb.2017.0187.

quote = "", comment.char = "",sep = "\t")

本文系外文翻译，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

数据库

sql

本文系外文翻译，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

数据库

sql

作者已关闭评论

0 条评论

热度