ChIP-seq数据应该是看peaks呢还是看motif

生信技能树

发布于 2021-10-12 12:06:17

2.3K0

发布于 2021-10-12 12:06:17

文章被收录于专栏：生信技能树

最近看到了一个研究，使用ChIP-Seq技术检测了转录因子SATB2在结肠上皮细胞中全基因组的结合位点，发现92.3%（39% intergenic regions和53.2% introns）的结合位点位于非启动子区域，我看了看，作者使用的就是经典的软件《HOMER》，定义启动子区域也是中规中矩：A promoter region was defined as a region within ± 2 Kb from the TSS.

全部的ChIP-seq数据流程在文章描述的很清晰：

ChIP-seq数据流程

也可以看我们《生信技能树》的B站免费NGS数据处理视频课程：

通常情况下，我们认为转录因子在某个基因的启动子区域结合是调控关系，靶基因。但是这个SATB2居然绝大部分的结合位点都不是各个基因的启动子区域，就很尴尬了。不过，研究者综合motif 富集分析发现了肠道关键转录因子CDX2，HNF4A和转录调控过程中发挥重要作用的转录激活因子P300，提示SATB2可能与CDX2和HNF4A共同参与了增强子激活靶基因的转录调控。

主要是体现在：Figure 6. SATB2 regulates enhancer dynamics and binding of intestinal TFs CDX2 and HNF4A (A–C) Extensive genomic co-binding of SATB2 with CDX2 and HNF4A in colonic epithelium.

其中图 (A) Top TF binding motifs enriched in SATB2 ChIP-seq sites (MACS p < 1 3 109) by HOMER in control colonic epithelium. Motifs were ranked by log10 (p value).

Top TF binding motifs

最开始仅仅是SATB2这个转录因子作为目标基因，然后ChIP-Seq技术看到了它的关联转录因子，就再做两个转录因子的ChIP-Seq数据，接下来就有3个数据啦，可以取交集：

所以图 (B) Venn diagram showing the overlaps among SATB2-, CDX2-, and HNF4A-bound regions. Shown are two biological replicates for each of the factors. 如下所示：

多个转录因子的peaks文件取交集

这个文章发表于2021年9月27日，美国康奈尔医学院周乔课题组在***Cell Stem Cell*** 期刊，文章标题是：《SATB2 preserves colon stem cell identity and mediates ileum-colon conversion via enhancer remodeling》，在线阅读链接是：https://doi.org/10.1016/j.stem.2021.09.004

其中的ChIP-Seq: GSE167287，此时此刻(2021-09-28)并不公开！

通常我们拿到了bed格式的peaks文件

写一个简单的脚本，文件名是 anno.R ，内容如下所示：

t1<-Sys.time()
temp_args <- commandArgs(trailingOnly = T)

if(length(temp_args) != 3){
  cat("运行命令方式：Rscript test.R folder sample species \n")
  quit("no")
}else{
  bedPeaksFile <- temp_args[1]
  species <-temp_args[2]
  folder <-temp_args[3]
  #BiocManager::install(c('ChIPseeker','TxDb.Hsapiens.UCSC.hg38.knownGene','org.Hs.eg.db'))
  # BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene")
  # BiocManager::install("ChIPseeker")
  require(ChIPseeker)
  require(TxDb.Hsapiens.UCSC.hg38.knownGene)
  require(TxDb.Mmusculus.UCSC.mm10.knownGene) 
  require(org.Hs.eg.db) 
  require(org.Mm.eg.db) 
  
  bedPeaksFile 
  cat(paste0('Now we process ', bedPeaksFile  ))
  peak <- readPeakFile( bedPeaksFile )  
  # table(seqlevels(peak))
  # keepChr= !grepl('_',seqlevels(peak))
  # seqlevels(peak, pruning.mode="coarse") <- seqlevels(peak)[keepChr]
  # table(seqlevels(peak))
  cat(paste0('there are ',length(peak),' peaks for this data' ))
  
  if(species=='human'){
    txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene 
    peakAnno <- annotatePeak(peak, tssRegion=c(-3000, 3000), 
                             TxDb=txdb, annoDb="org.Hs.eg.db") 
  }else{
    txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene 
    peakAnno <- annotatePeak(peak, tssRegion=c(-3000, 3000), 
                             TxDb=txdb, annoDb="org.Mm.eg.db") 
  }
  
  df=as.data.frame(peakAnno)
  cg_df = df[,c(1,2,4,23)]
  head(cg_df)
  cl=ifelse(grepl('Promoter',df$annotation),'Promoter',
            ifelse(grepl('Intron',df$annotation),'Intron',
                   ifelse(grepl('Intergenic',df$annotation),'Intergenic',
                          ifelse(grepl('Exon',df$annotation),'Exon',
                                 'other'))))
  table(cl)
  cg_df$anno=cl
  plotAnnoPie(peakAnno)
  write.table(cg_df,sep = '\t',quote = F,row.names = F,col.names = F,
              file = file.path(folder,
                               gsub('sort_peaks.narrowPeak.bed','ChIPseeker_anno.txt',
                                    basename(bedPeaksFile))))
  
  print("all done!!!")
  t2<-Sys.time()
  t2
  df <- t1-t2
  print(df)
}

就可以在命令行运行：

 Rscript anno.R  sort_peaks.narrowPeak.bed  human tf_human/

假如你有成百上千个bed文件，也可以使用这个格式的命令行，批量提交。

非常方便！

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-09-28，如有侵权请联系 cloudcommunity@tencent.com 删除

tcp/ip

node.js

本文分享自生信技能树微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

tcp/ip

node.js

登录后参与评论

0 条评论

热度