最近看到了一个研究,使用ChIP-Seq技术检测了转录因子SATB2在结肠上皮细胞中全基因组的结合位点,发现92.3%(39% intergenic regions和53.2% introns)的结合位点位于非启动子区域,我看了看,作者使用的就是经典的软件《HOMER》,定义启动子区域也是中规中矩:A promoter region was defined as a region within ± 2 Kb from the TSS.
全部的ChIP-seq数据流程在文章描述的很清晰:
ChIP-seq数据流程
也可以看我们《生信技能树》的B站免费NGS数据处理视频课程:
通常情况下,我们认为转录因子在某个基因的启动子区域结合是调控关系,靶基因。但是这个SATB2居然绝大部分的结合位点都不是各个基因的启动子区域,就很尴尬了。不过,研究者综合motif 富集分析发现了肠道关键转录因子CDX2,HNF4A和转录调控过程中发挥重要作用的转录激活因子P300,提示SATB2可能与CDX2和HNF4A共同参与了增强子激活靶基因的转录调控。
主要是体现在:Figure 6. SATB2 regulates enhancer dynamics and binding of intestinal TFs CDX2 and HNF4A (A–C) Extensive genomic co-binding of SATB2 with CDX2 and HNF4A in colonic epithelium.
其中图 (A) Top TF binding motifs enriched in SATB2 ChIP-seq sites (MACS p < 1 3 109) by HOMER in control colonic epithelium. Motifs were ranked by log10 (p value).
Top TF binding motifs
最开始仅仅是SATB2这个转录因子作为目标基因,然后ChIP-Seq技术看到了它的关联转录因子,就再做两个转录因子的ChIP-Seq数据,接下来就有3个数据啦,可以取交集:
所以图 (B) Venn diagram showing the overlaps among SATB2-, CDX2-, and HNF4A-bound regions. Shown are two biological replicates for each of the factors. 如下所示:
多个转录因子的peaks文件取交集
这个文章发表于2021年9月27日,美国康奈尔医学院周乔课题组在***Cell Stem Cell*** 期刊,文章标题是:《SATB2 preserves colon stem cell identity and mediates ileum-colon conversion via enhancer remodeling》,在线阅读链接 是:https://doi.org/10.1016/j.stem.2021.09.004
其中的ChIP-Seq: GSE167287,此时此刻(2021-09-28)并不公开!
写一个简单的脚本,文件名是 anno.R ,内容如下所示:
t1<-Sys.time()
temp_args <- commandArgs(trailingOnly = T)
if(length(temp_args) != 3){
cat("运行命令方式:Rscript test.R folder sample species \n")
quit("no")
}else{
bedPeaksFile <- temp_args[1]
species <-temp_args[2]
folder <-temp_args[3]
#BiocManager::install(c('ChIPseeker','TxDb.Hsapiens.UCSC.hg38.knownGene','org.Hs.eg.db'))
# BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene")
# BiocManager::install("ChIPseeker")
require(ChIPseeker)
require(TxDb.Hsapiens.UCSC.hg38.knownGene)
require(TxDb.Mmusculus.UCSC.mm10.knownGene)
require(org.Hs.eg.db)
require(org.Mm.eg.db)
bedPeaksFile
cat(paste0('Now we process ', bedPeaksFile ))
peak <- readPeakFile( bedPeaksFile )
# table(seqlevels(peak))
# keepChr= !grepl('_',seqlevels(peak))
# seqlevels(peak, pruning.mode="coarse") <- seqlevels(peak)[keepChr]
# table(seqlevels(peak))
cat(paste0('there are ',length(peak),' peaks for this data' ))
if(species=='human'){
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
peakAnno <- annotatePeak(peak, tssRegion=c(-3000, 3000),
TxDb=txdb, annoDb="org.Hs.eg.db")
}else{
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
peakAnno <- annotatePeak(peak, tssRegion=c(-3000, 3000),
TxDb=txdb, annoDb="org.Mm.eg.db")
}
df=as.data.frame(peakAnno)
cg_df = df[,c(1,2,4,23)]
head(cg_df)
cl=ifelse(grepl('Promoter',df$annotation),'Promoter',
ifelse(grepl('Intron',df$annotation),'Intron',
ifelse(grepl('Intergenic',df$annotation),'Intergenic',
ifelse(grepl('Exon',df$annotation),'Exon',
'other'))))
table(cl)
cg_df$anno=cl
plotAnnoPie(peakAnno)
write.table(cg_df,sep = '\t',quote = F,row.names = F,col.names = F,
file = file.path(folder,
gsub('sort_peaks.narrowPeak.bed','ChIPseeker_anno.txt',
basename(bedPeaksFile))))
print("all done!!!")
t2<-Sys.time()
t2
df <- t1-t2
print(df)
}
就可以在命令行运行:
Rscript anno.R sort_peaks.narrowPeak.bed human tf_human/
假如你有成百上千个bed文件,也可以使用这个格式的命令行,批量提交。
非常方便!