前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >整合单细胞数据和Bulk数据的多种方法(二):R包Scissor

整合单细胞数据和Bulk数据的多种方法(二):R包Scissor

作者头像
生信技能树
发布2023-02-27 20:54:08
2.1K0
发布2023-02-27 20:54:08
举报
文章被收录于专栏:生信技能树生信技能树

上个帖子介绍了整合单细胞数据和Bulk数据的最新方法scAB,整合单细胞数据和Bulk数据的多种方法(一):R包scAB。除了scAB以外,衔接单细胞数据和Bulk测序数据的方法还有Scissor, scPrognosis 和DEGAS 等工具。本系列文章计划依次介绍这些算法,提供给大家更多的选择。本文将继续介绍Scissor算法

Scissor发表在2021年Nature Biotechnology上,题为《Identifying phenotype-associated subpopulations by integrating bulk and single-cell sequencing data》。和scAB算法一样,Scissor算法可利用单细胞数据和bulk RNA-seq数据及表型信息识别与疾病高度相关的细胞亚群。

image-20230125165155292

本文将介绍:

  • Scissor算法的工作流程;
  • 使用R实现方法(github: https://github.com/sunduanchen/Scissor):为了横向比较Scissor和scAB算法,本文采用scAB包的示例数据,最后将Scissor的结果和scAB算法的结果进行对比。

一. Scissor算法的工作流程

Figure_Method

(a)输入数据为

  • scRNA-seq单细胞数据;
  • 其次是对应组织的Bulk RNA-seq数据,以及表型数据。

(b)然后计算每对细胞和bulk样本的Pearson相关系数构建相关系数矩阵 ,Scissor会通过优化样本表型Y与相关矩阵S的回归模型。

(c)由上述优化模型求解的非零系数β用于选择与目标表型相关的细胞亚群。其中Scissor+ 表示所选择的细胞与感兴趣的表型呈正相关,Scissor-为负相关。表型信息可以是连续变量、二分向量或临床生存数据,会分别对应不同的回归方法。

(d and e) 为控制假阳性后续会有一个可靠性显著性检验以及差异表达基因分析、功能富集分析和motif分析等下游分析。

参考来源:生信补给站 跟NBT学Scissor| bulk RNA + scRNA鉴定与目标表型相关的细胞亚群

二. Scissor的代码实现

Scissor R包github在 https://github.com/sunduanchen/Scissor.

安装:

代码语言:javascript
复制
devtools::install_github("jinworks/scAB")
Step1. 加载R包
代码语言:javascript
复制
library(Scissor)
Step2. 加载示例数据

这里采用scAB包的示例数据进行演示:

代码语言:javascript
复制
library(Seurat)
library(preprocessCore)
library(scAB)
#设置可用的内存
# options(future.globals.maxSize = 10 * 1024^3)

#Load data
data("data_survival")
dim(sc_dataset)
#> [1] 21287  8853
sc_dataset
# An object of class Seurat 
# 21287 features across 8853 samples within 1 assay 
# Active assay: RNA (21287 features, 0 variable features)
head(bulk_dataset[,1:10])
head(phenotype)
#                 time status
# TCGA-2Y-A9GS-01  724      1
# TCGA-2Y-A9GT-01 1624      1
# TCGA-2Y-A9GU-01 1939      0
# TCGA-2Y-A9GV-01 2532      1
# TCGA-2Y-A9GW-01 1271      1
# TCGA-2Y-A9GX-01 2442      0
colnames(bulk_dataset) == row.names(phenotype)

sc_dataset <- run_seurat(sc_dataset,verbose = FALSE)
sc_dataset

UMAP_celltype <- DimPlot(sc_dataset, reduction ="umap",
                         group.by="celltype",label = T)
UMAP_celltype

image-20230223174400371

Step3. 运行Scissor

与scAB一样,在获取到了单细胞表达矩阵、Bulk RNA-seq表达矩阵及其临床信息之后,运行Scissor:

代码语言:javascript
复制
infos1 <- Scissor(bulk_dataset, sc_dataset, phenotype, alpha = 0.05, 
                  family = "cox", Save_file = './Scissor_survival.RData')

image-20230223174116647

这里我在Linux服务器上遇到了这个报错,而在Windows下运行成功,没搞明白。。所以我检查了源代码,修复了这个bug,只要运行下面的函数,再跑一次就行了:

代码语言:javascript
复制
Scissor <- function (bulk_dataset, sc_dataset, phenotype, tag = NULL, alpha = NULL, 
                     cutoff = 0.2, family = c("gaussian", "binomial", "cox"), 
                     Save_file = "Scissor_inputs.RData", Load_file = NULL) 
{
  library(Seurat)
  library(Matrix)
  library(preprocessCore)
  if (is.null(Load_file)) {
    common <- intersect(rownames(bulk_dataset), rownames(sc_dataset))
    if (length(common) == 0) {
      stop("There is no common genes between the given single-cell and bulk samples.")
    }
    if (class(sc_dataset) == "Seurat") {
      sc_exprs <- as.matrix(sc_dataset@assays$RNA@data)
      network <- as.matrix(sc_dataset@graphs$RNA_snn)
    }
    else {
      sc_exprs <- as.matrix(sc_dataset)
      Seurat_tmp <- CreateSeuratObject(sc_dataset)
      Seurat_tmp <- FindVariableFeatures(Seurat_tmp, selection.method = "vst", 
                                         verbose = F)
      Seurat_tmp <- ScaleData(Seurat_tmp, verbose = F)
      Seurat_tmp <- RunPCA(Seurat_tmp, features = VariableFeatures(Seurat_tmp), 
                           verbose = F)
      Seurat_tmp <- FindNeighbors(Seurat_tmp, dims = 1:10, 
                                  verbose = F)
      network <- as.matrix(Seurat_tmp@graphs$RNA_snn)
    }
    diag(network) <- 0
    network[which(network != 0)] <- 1
    dataset0 <- cbind(bulk_dataset[common, ], sc_exprs[common, 
    ])
    dataset1 <- normalize.quantiles(as.matrix(dataset0))
    rownames(dataset1) <- rownames(dataset0)
    colnames(dataset1) <- colnames(dataset0)
    Expression_bulk <- dataset1[, 1:ncol(bulk_dataset)]
    Expression_cell <- dataset1[, (ncol(bulk_dataset) + 
                                     1):ncol(dataset1)]
    X <- cor(Expression_bulk, Expression_cell)
    quality_check <- quantile(X)
    print("|**************************************************|")
    print("Performing quality-check for the correlations")
    print("The five-number summary of correlations:")
    print(quality_check)
    print("|**************************************************|")
    if (quality_check[3] < 0.01) {
      warning("The median correlation between the single-cell and bulk samples is relatively low.")
    }
    if (family == "binomial") {
      Y <- as.numeric(phenotype)
      z <- table(Y)
      if (length(z) != length(tag)) {
        stop("The length differs between tags and phenotypes. Please check Scissor inputs and selected regression type.")
      }
      else {
        print(sprintf("Current phenotype contains %d %s and %d %s samples.", 
                      z[1], tag[1], z[2], tag[2]))
        print("Perform logistic regression on the given phenotypes:")
      }
    }
    if (family == "gaussian") {
      Y <- as.numeric(phenotype)
      z <- table(Y)
      if (length(z) != length(tag)) {
        stop("The length differs between tags and phenotypes. Please check Scissor inputs and selected regression type.")
      }
      else {
        tmp <- paste(z, tag)
        print(paste0("Current phenotype contains ", 
                     paste(tmp[1:(length(z) - 1)], collapse = ", "), 
                     ", and ", tmp[length(z)], " samples."))
        print("Perform linear regression on the given phenotypes:")
      }
    }
    if (family == "cox") {
      Y <- as.matrix(phenotype)
      if (ncol(Y) != 2) {
        stop("The size of survival data is wrong. Please check Scissor inputs and selected regression type.")
      }
      else {
        print("Perform cox regression on the given clinical outcomes:")
      }
    }
    save(X, Y, network, Expression_bulk, Expression_cell, 
         file = Save_file)
  }
  else {
    load(Load_file)
  }
  if (is.null(alpha)) {
    alpha <- c(0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 
               0.6, 0.7, 0.8, 0.9)
  }
  for (i in 1:length(alpha)) {
    set.seed(123)
    fit0 <- APML1(X, Y, family = family, penalty = "Net", 
                  alpha = alpha[i], Omega = network, nlambda = 100, 
                  nfolds = min(10, nrow(X)))
    fit1 <- APML1(X, Y, family = family, penalty = "Net", 
                  alpha = alpha[i], Omega = network, lambda = fit0$lambda.min)
    if (family == "binomial") {
      Coefs <- as.numeric(fit1$Beta[2:(ncol(X) + 1)])
    }
    else {
      Coefs <- as.numeric(fit1$Beta)
    }
    Cell1 <- colnames(X)[which(Coefs > 0)]
    Cell2 <- colnames(X)[which(Coefs < 0)]
    percentage <- (length(Cell1) + length(Cell2))/ncol(X)
    print(sprintf("alpha = %s", alpha[i]))
    print(sprintf("Scissor identified %d Scissor+ cells and %d Scissor- cells.", 
                  length(Cell1), length(Cell2)))
    print(sprintf("The percentage of selected cell is: %s%%", 
                  formatC(percentage * 100, format = "f", digits = 3)))
    if (percentage < cutoff) {
      break
    }
    cat("\n")
  }
  print("|**************************************************|")
  return(list(para = list(alpha = alpha[i], lambda = fit0$lambda.min, 
                          family = family), Coefs = Coefs, Scissor_pos = Cell1, 
              Scissor_neg = Cell2))
}

再次运行:

代码语言:javascript
复制
infos1 <- Scissor(bulk_dataset, sc_dataset, phenotype, alpha = 0.05, 
                  family = "cox", Save_file = './Scissor_survival.RData')

names(infos1)
# [1] "para"        "Coefs"       "Scissor_pos" "Scissor_neg"
length(infos1$Scissor_pos)
# [1] 429
infos1$Scissor_pos[1:4]
length(infos1$Scissor_neg)
# 1001
infos1$Scissor_neg

运行大概三五分钟,在本地文件夹下即可生成结果Rdata,检查结果可见共识别429个Scissor+细胞(即与表型呈正相关的细胞)和1001个Scissor阴性细胞(即与表型呈负相关的细胞)。这里要强调一下Scissor R包的运行速度是真的快。而scAB运行这个8000个细胞的示例数据,需要花费好几个小时,效率上相差甚远。

Step4. 可视化结果
代码语言:javascript
复制
Scissor_select <- rep(0, ncol(sc_dataset))
names(Scissor_select) <- colnames(sc_dataset)
Scissor_select[infos1$Scissor_pos] <- "Scissor+"
Scissor_select[infos1$Scissor_neg] <- "Scissor-"
sc_dataset <- AddMetaData(sc_dataset, metadata = Scissor_select, col.name = "scissor")
UMAP_scissor <- DimPlot(sc_dataset, reduction = 'umap', 
        group.by = 'scissor',
        cols = c('grey','royalblue','indianred1'), 
        pt.size = 0.001, order = c("Scissor+","Scissor-"))
UMAP_scissor
patchwork::wrap_plots(plots = list(UMAP_celltype,UMAP_scissor), ncol = 2)

image-20230223180526286

检查结果可知Scissor+ 细胞主要为肿瘤细胞,Scissor阴性细胞主要为T细胞和CAFs:

代码语言:javascript
复制
table(sc_dataset$scissor,sc_dataset$celltype)
#           B cell  CAF HPC-like Malignant cell T cell  TAM  TEC
# 0           907  718      728           1066   2405  652  947
# Scissor-     85  225       78            101    358   52  102
# Scissor+      1    4       35            337     29   20    3

library(gplots)
balloonplot(table(sc_dataset$scissor,sc_dataset$celltype))

image-20230223180753824

Step5. 对比scAB的结果

加载上期scAB的结果:

代码语言:javascript
复制
scAB.res = read_rds("scAB.res.rds")
table(row.names(sc_dataset@meta.data) == row.names(scAB.res@meta.data))
#TRUE 
#8853 
sc_dataset$scAB = scAB.res$scAB_select

可视化:

代码语言:javascript
复制
UMAP_scAB <- DimPlot(sc_dataset,group.by="scAB",
                     cols=c("#80b1d3","red"),
                     pt.size=0.001,
                     order=c("scAB+ cells","Other cells"))

patchwork::wrap_plots(plots = list(UMAP_celltype,UMAP_scissor,UMAP_scAB), ncol = 3)

image-20230223180558402

取交集:

代码语言:javascript
复制
balloonplot(table(sc_dataset$scissor,sc_dataset$scAB))

image-20230223180919836

只有143个细胞是scAB+ Scissor+细胞。个人感觉结果相差还是有点大的。

总结来说:

  • scAB运行较慢(8853个细胞需2小时左右),但是似乎会得到更多的表型相关阳性细胞(915/8853 vs. 429/8853),另外scAB可以整合scATAC-seq数据
  • Scissor运行很快(8853个细胞需5分钟左右),但是似乎会得到更少的表型相关阳性细胞,另外Scissor还提供了表型相关阴性细胞。
本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2023-02-26,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信技能树 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 一. Scissor算法的工作流程
  • 二. Scissor的代码实现
    • Step1. 加载R包
      • Step2. 加载示例数据
        • Step3. 运行Scissor
          • Step4. 可视化结果
            • Step5. 对比scAB的结果
            领券
            问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档