# 使用GSVA方法计算某基因集在各个样本的表现

### 算法细节

GSEA 算法 GSEA分析一文就够（单机版+R语言版） GSEA的统计学原理试讲

GSVA starts by evaluating whether a gene i is highly or lowly expressed in sample j in the context of the sample population distribution.

For each gene expression profile xi={xi1,…,xin}, a non-parametric kernel estimation of its cumulative density function is performed.

We offer two approaches for turning the KS like random walk statistic into an enrichment statistic (ES) (also called GSVA score), the classical maximum deviation method and a normalized ES.

### 安装GSVA这个R包

```## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
biocLite("GSVA")
library(GSVA)
browseVignettes("GSVA")
browseVignettes("estimate")```

• method="plage" (Tomfohr et al., 2005). Pathway level analysis of gene expression (PLAGE)
• method="zscore" (Lee et al., 2008). The combined z-score method a
• method="ssgsea" (Barbie et al., 2009). Single sample GSEA (ssGSEA) calculates a gene set enrichment score per sample

### 先在模拟数据应用GSVA

```library(GSVA)

p <- 20000    ## number of genes
n <- 30       ## number of samples
nGS <- 100    ## number of gene sets
min.sz <- 10  ## minimum gene set size
max.sz <- 100 ## maximum gene set size
X <- matrix(rnorm(p*n), nrow=p, dimnames=list(1:p, 1:n))
dim(X)
gs <- as.list(sample(min.sz:max.sz, size=nGS, replace=TRUE)) ## sample gene set sizes
gs <- lapply(gs, function(n, p) sample(1:p, size=n, replace=FALSE), p) ## sample gene sets
es.max <- gsva(X, gs, mx.diff=FALSE, verbose=FALSE, parallel.sz=1)
es.dif <- gsva(X, gs, mx.diff=TRUE, verbose=FALSE, parallel.sz=1)
pheatmap::pheatmap(es.max)
pheatmap::pheatmap(es.dif)```

#### 两个模型的区别

```par(mfrow=c(1,2), mar=c(4, 4, 4, 1))
plot(density(as.vector(es.max)), main="Maximum deviation from zero",
xlab="GSVA score", lwd=2, las=1, xaxt="n", xlim=c(-0.75, 0.75), cex.axis=0.8)
axis(1, at=seq(-0.75, 0.75, by=0.25), labels=seq(-0.75, 0.75, by=0.25), cex.axis=0.8)
plot(density(as.vector(es.dif)), main="Difference between largest\npositive and negative deviations",
xlab="GSVA score", lwd=2, las=1, xaxt="n", xlim=c(-0.75, 0.75), cex.axis=0.8)
axis(1, at=seq(-0.75, 0.75, by=0.25), labels=seq(-0.75, 0.75, by=0.25), cex.axis=0.8)
# 前者是高斯分布，后者是二项式分布```

### 在真实数据(白血病芯片数据)上面测试GSVA算法

```library(GSEABase)

library(Biobase)
library(genefilter)
library(limma)
library(RColorBrewer)
library(GSVA)

cacheDir <- system.file("extdata", package="GSVA")
cachePrefix <- "cache4vignette_"
file.remove(paste(cacheDir,
list.files(cacheDir, pattern=cachePrefix), sep="/"))

data(leukemia)
leukemia_eset
table(leukemia_eset\$subtype)```

```filtered_eset <- nsFilter(leukemia_eset, require.entrez=TRUE,
remove.dupEntrez=TRUE,
var.func=IQR, var.filter=TRUE,
var.cutoff=0.5, filterByQuantile=TRUE,
feature.exclude="^AFFX")
filtered_eset
leukemia_filtered_eset <- filtered_eset\$eset
leukemia_filtered_eset
exprs(leukemia_filtered_eset)[1:4,1:4]```

#### 根据表型数据使用limma包来找到有显著差异的基因集

```cache(leukemia_es <- gsva(leukemia_filtered_eset, c2BroadSets,
min.sz=10, max.sz=500, verbose=TRUE),
dir=cacheDir, prefix=cachePrefix)

logFCcutoff <- log2(2)
design <- model.matrix(~ factor(leukemia_es\$subtype))
colnames(design) <- c("ALL", "MLLvsALL")
fit <- lmFit(leukemia_es, design)
fit <- eBayes(fit)
allGeneSets <- topTable(fit, coef="MLLvsALL", number=Inf)
DEgeneSets <- topTable(fit, coef="MLLvsALL", number=Inf,
summary(res)```

Thus, there are 38 MSigDB C2 curated pathways that are differentially activated between MLL and ALL at 0.1% FDR.

#### 同样limma当然是可以找显著差异表达的基因的

```logFCcutoff <- log2(2)
design <- model.matrix(~ factor(leukemia_eset\$subtype))
colnames(design) <- c("ALL", "MLLvsALL")
fit <- lmFit(leukemia_filtered_eset, design)
fit <- eBayes(fit)
allGenes <- topTable(fit, coef="MLLvsALL", number=Inf)
DEgenes <- topTable(fit, coef="MLLvsALL", number=Inf,
summary(res)```

Here, 122 genes show up as being differentially expressed with a minimum fold-change of 2 at 0.1% FDR.

### 还可以对包内置的TCGA数据进行测试

```data(gbm_VerhaakEtAl)
gbm_eset
exprs(gbm_eset)[1:4,1:4]
### 如下
ExpressionSet (storageMode: lockedEnvironment)
assayData: 11861 features, 173 samples
element names: exprs
protocolData: none
phenoData
rowNames: TCGA.02.0003.01A.01 TCGA.02.0010.01A.01
... TCGA.12.0620.01A.01 (173 total)
varLabels: subtype
featureData: none
experimentData: use 'experimentData(object)'
Annotation:
> ```

```data(gbm_VerhaakEtAl)
gbm_eset
table(gbm_eset\$subtype)
data(brainTxDbSets)
sapply(brainTxDbSets, length)
## 可以看到，都是symbol格式的基因ID
gbm_es <- gsva(gbm_eset, brainTxDbSets, mx.diff=FALSE, verbose=FALSE, parallel.sz=1)```

### 不同算法在转录组测序数据的表现

```canonicalC2BroadSets <- c2BroadSets[c(grep("^KEGG", names(c2BroadSets)),

```data(genderGenesEntrez)

MSY <- GeneSet(msYgenesEntrez, geneIdType=EntrezIdentifier(),
MSY
XiE <- GeneSet(XiEgenesEntrez, geneIdType=EntrezIdentifier(),
XiE

`myC7 <- getGmt("c7.all.v5.1.entrez.gmt", geneIdType=EntrezIdentifier(), collectionType=BroadCollection(category="c7"), sep="\t")`

```library(GSEABase)
### 假设之前是其它ID
library(hgu95av2.db)

554 篇文章254 人订阅

0 条评论

## 相关文章

39620

38950

### SSE图像算法优化系列二：高斯模糊算法的全面优化过程分享（二）。

相关链接： 高斯模糊算法的全面优化过程分享（一）      在高斯模糊算法的全面优化过程分享（一）一文中我们已经给出了一种相当高性能的高斯模糊过程，...

45360

52060

17150

53060

### 【学术】不懂神经网络？不怕，一文教你用JavaScript构建神经网络

AiTechYun 编辑：xiaoshan.xiang 本文的内容并不是关于神经网络的深度教程，在这里既不会深入研究输入层、激活函数的内部原理，也不会教你如何使...

35240

25520

### ToppGene Suite中文使用指南

2007.12：Improved human disease candidate gene prioritization using mouse phenoty...

20230

53760