专栏首页医学生物信息学R语言CMplot包绘制曼哈顿图

R语言CMplot包绘制曼哈顿图

曼哈顿图本质上是一个散点图,用于显示大量非零大范围波动数值,最早应用于全基因组关联分析(GWAS)研究展示高度相关位点。它得名源于样式与曼哈顿天际线相似。

近几年,在宏基因组领域,尤其是差异OTU结合分类学结果,采用Manhattan plot展示有非常好的效果,倍受推崇。

曼哈顿图优点

大数据中,即展示数据全貌,又能快速找到目标基因或OTU,同时可知目标的具体位置和分类、显著程度等信息。绝对高端大气,而且还有内涵。

数据坐标轴介绍

以GWAS研究结果为例:

- X轴为染色体编号,且每个基因组SNP位点沿染色体序列排列;在16S扩增子或宏基因组中则为OTU按Taxonomy某一级别排序。

- Y轴为该位点相关的统计显著性P-value值,由于p-value值范围是从0-1,且越小越好,直接展示非常密集于0附近,很难区分。如何使越近0的显著数值变大,且而容易区分开,log10变换是非常好的方法,直接把关注的高显著性(Pvalue趋近零)值高位显示,远离整体,目标一目了然。

- 图中水平线一般为设定的不同显著性水平阈值,方便读出每个点的显著性水平;或只添加一条显示性阈值,高于则显著。

曼哈顿图绘制工具

散点图,自然还是R语言,ggplot2可以画的非常漂亮。这里我们介绍CMplot包绘制曼哈顿图。

1.安装并加载所需R包

 > # CMplot在CRAN上可用,因此可以使用以下R代码安装它
 > install.packages("CMplot") # 安装包,如果已经安装,此行可忽略。
 > library(CMplot)

CMplot中有两个示例数据集,用户可以通过以下R代码导出和查看详细信息:

 > data(pig60K)   #calculated p-values by MLM
 > data(cattle50K)   #calculated SNP effects by rrblup
 > head(pig60K)
 
           SNP Chromosome Position    trait1     trait2     trait3
 1 ALGA0000009          1    52297 0.7738187 0.51194318 0.51194318
 2 ALGA0000014          1    79763 0.7738187 0.51194318 0.51194318
 3 ALGA0000021          1   209568 0.7583016 0.98405289 0.98405289
 4 ALGA0000022          1   292758 0.7200305 0.48887140 0.48887140
 5 ALGA0000046          1   747831 0.9736840 0.22096836 0.22096836
 6 ALGA0000047          1   761957 0.9174565 0.05753712 0.05753712
 
 > head(cattle50K)
 
    SNP chr    pos Somatic cell score  Milk yield Fat percentage
 1 SNP1   1  59082        0.000244361 0.000484255    0.001379210
 2 SNP2   1 118164        0.000532272 0.000039800    0.000598951
 3 SNP3   1 177246        0.001633058 0.000311645    0.000279427
 4 SNP4   1 236328        0.001412865 0.000909370    0.001040161
 5 SNP5   1 295410        0.000090700 0.002202973    0.000351394
 6 SNP6   1 354493        0.000110681 0.000342628    0.000105792

作为示例数据集,前三列分别为SNPs的名称、染色体、位置,列的res为GWAS的p-value或GS/GP对性状的影响,traits的数量是无限的。注意:如果绘制SNP_Density,只需要前三列。

简单来说,前三列分别为SNP的名称,所在染色体,SNP的位置,后面每列为不同性状的P值,每个性状单独一列

CMplot不仅可以处理全基因组的关联研究结果,还可以处理SNP效应、Fst、tajima’s D等。

CMplot共有40个参数,输入?CMplot可以得到所有参数的详细功能。

常用参数解释如下:

 Pmap    输入数据文件
 col 设置不同染色体中点的颜色
 cex 设置点的大小
 pch 设置点的形状
 band    设置不同染色体之间的间隔
 H   设置每个圈的高度
 ylim    设置y轴的范围
 bin.size    设置SNP密度图中的窗口大小
 cex.axis    设置坐标轴字体和标签字体的大小
 plot.type   设置不同的绘图类型,可以设定为 "d", "c", "m", "q" or "b"
 multracks   设置是否需要绘制多个track
 r   设置圈的半径大小
 xlab    设置x轴标签
 ylab    设置y轴标签
 outward 设置点的朝向是否向外
 threshold   设置阈值并添加阈值线
 threshold.col   设置阈值线的颜色
 threshold.lwd   设置阈值线的宽度
 threshold.lty   设置阈值线的类型
 amplify 设置是否放大显著的点
 signal.cex  设置显著点的大小
 signal.pch  设置显著点的形状
 signal.col  设置显著点的颜色
 chr.labels  设置染色体的标签
 chr.den.col 设置SNP密度图的颜色
 cir.band    设置环状曼哈度图中不同染色体之间的间隔
 cir.chr 设置是否显示染色体的边界
 cir.chr.h   设置染色体边界的高度
 cir.legend  设置是否显示图例
 cir.legend.cex  设置图例字体的大小
 cir.legend.col  设置图例的颜色
 LOG10   设置是否对p-value取log10对数
 conf.int.col    设置QQ图中置信区间的颜色
 file.output 设置是否输出图片
 file    设置输出图片的格式,可以设定为"jpg", "pdf", "tiff"
 dpi 设置输出图片的分辨度
 memo    设置输出图片文件的名字

2.默认绘图(分别绘制出SNP密度图,曼哈顿图,环形曼哈顿图和QQ图)

2.1. SNP密度图

 > CMplot(pig60K,plot.type="d",bin.size=1e6,col=c("darkgreen", "yellow", "red"),file="jpg",memo="",dpi=300,
     file.output=TRUE, verbose=TRUE)
 # users can personally set the windowsize and the max of legend by:
 # bin.size=1e6
 # bin.max=N
 # memo: add a character to the output file name.

2.2. 环形曼哈顿图

(1).全基因组关联研究(Genome-wide association study(GWAS))

 >CMplot(pig60K,plot.type="c",chr.labels=paste("Chr",c(1:18,"X"),sep=""),r=0.4,cir.legend=TRUE,outward=FALSE,cir.legend.col="black",cir.chr.h=1.3,chr.den.col="black",file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE)
 >CMplot(pig60K,plot.type="c",r=0.4,col=c("grey30","grey60"),chr.labels=paste("Chr",c(1:18,"X"),sep=""),threshold=c(1e-6,1e-4),cir.chr.h=1.5,amplify=TRUE,threshold.lty=c(1,2),threshold.col=c("red","blue"),signal.line=1,signal.col=c("red","green"),chr.den.col=c("darkgreen","yellow","red"),
 bin.size=1e6,outward=FALSE,file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE)
 #Note:
 1. if signal.line=NULL, the lines that crosse circles won't be added.
 2. if the length of parameter 'chr.den.col' is not equal to 1, SNP density that counts 
    the number of SNP within given size('bin.size') will be plotted around the circle.

(2). 基因组选择/预测(Genomic Selection/Prediction(GS/GP))

 >CMplot(cattle50K,plot.type="c",LOG10=FALSE,outward=TRUE,col=matrix(c("#4DAF4A",NA,NA,"dodgerblue4","deepskyblue",NA,"dodgerblue1", "olivedrab3", "darkgoldenrod1"), nrow=3, byrow=TRUE),chr.labels=paste("Chr",c(1:29),sep=""),threshold=NULL,r=1.2,cir.chr.h=1.5,cir.legend.cex=0.5,cir.band=1,file="jpg",memo="",dpi=300,chr.den.col="black",file.output=TRUE,verbose=TRUE)
         
 #Note: 
 Parameter 'col' can be either vector or matrix, if a matrix, each trait can be plotted in different colors.

2.3. Single_track Rectangular-Manhattan plot

(1) Genome-wide association study(GWAS)

 >CMplot(pig60K,plot.type="m",LOG10=TRUE,threshold=NULL,chr.den.col=NULL,file="jpg",memo="",dpi=300, ,file.output=TRUE,verbose=TRUE)
 > CMplot(pig60K, plot.type="m", col=c("grey30","grey60"), LOG10=TRUE, ylim=c(2,12), threshold=c(1e-6,1e-4),threshold.lty=c(1,2), threshold.lwd=c(1,1), threshold.col=c("black","grey"), amplify=TRUE,chr.den.col=NULL, signal.col=c("red","green"),signal.cex=c(1,1),signal.pch=c(19,19),file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE)
 
 #Note:
 if the ylim is setted, then CMplot will only plot the ponits which among this interval.
 > CMplot(pig60K, plot.type="m", LOG10=TRUE, ylim=NULL, threshold=c(1e-6,1e-4),threshold.lty=c(1,2),threshold.lwd=c(1,1), threshold.col=c("black","grey"), amplify=TRUE,bin.size=1e6,chr.den.col=c("darkgreen", "yellow","red"),signal.col=c("red","green"),signal.cex=c(1,1),signal.pch=c(19,19),file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE)
         
 #Note:
 if the length of parameter 'chr.den.col' is bigger than 1, SNP density that counts 
    the number of SNP within given size('bin.size') will be plotted.

(2) Genomic Selection/Prediction(GS/GP)

 > CMplot(cattle50K, plot.type="m", band=0.5, LOG10=FALSE, ylab="SNP effect",threshold=c(-0.015, 0.015),threshold.lty=2, threshold.lwd=1, threshold.col="red", amplify=FALSE, chr.den.col=NULL, file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE,cex=0.8)
 
 #Note: 
 if signal.col=NULL, the significant SNPs will be plotted with original colors.
 > cattle50K[,4:ncol(cattle50K)] <- apply(cattle50K[,4:ncol(cattle50K)], 2, function(x) x*sample(c(1,-1), length(x), rep=TRUE))
 > CMplot(cattle50K, plot.type="m", band=0, LOG10=FALSE, ylab="Abs(SNP effect)",threshold=0.015,threshold.lty=2, threshold.lwd=1, threshold.col="red", amplify=TRUE, signal.col=NULL,chr.den.col=NULL, file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE)
 
 #Note: Positive and negative values are acceptable.

2.4. Multi_tracks Rectangular-Manhattan plot

 > CMplot(pig60K, plot.type="m", multracks=TRUE, threshold=c(1e-6,1e-4),threshold.lty=c(1,2), threshold.lwd=c(1,1), threshold.col=c("black","grey"), amplify=TRUE,bin.size=1e6, chr.den.col=c("darkgreen", "yellow", "red"), signal.col=c("red","green"),signal.cex=c(1,1),file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE)

a. all traits in separated axes:

b. all traits in a axes:

2.4. Q-Q plot

2.4.1. Single_track Q-Q plot

 > CMplot(pig60K,plot.type="q",conf.int.col=NULL,box=TRUE,file="jpg",memo="",dpi=300,
     ,file.output=TRUE,verbose=TRUE)

2.4.2. Multi_tracks Q-Q plot

 > CMplot(pig60K,plot.type="q",col=c("dodgerblue1", "olivedrab3", "darkgoldenrod1"),threshold=1e6,signal.pch=19,signal.cex=1.5,signal.col="red",conf.int.col="grey",box=FALSE,multracks=TRUE,file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE)

a. all traits in a axes:

b. all traits in separated axes:

3.参数说明

 Pmap: a dataframe, at least four columns. The first column is the name of SNP, the second column is the chromosome of SNP, the third column is the position of SNP, and the remaining columns are the P-value of each trait(Note:each trait a column).
 
 col: a vector or a matrix, if "col" equals to a vector, each circle use the same colors, it means that the same chromosome is drewed in the same color, the colors are not fixed, one, two, three or more colors can be used, if the length of the "col" is shorter than the length the chromosome, then colors will be applied circularly. 
   if "col" equals to a matrix, the row is the number of circles(traits), the columns are the colors that users want to use for different circles, so each circle can be plotted in different number of colors, the missing value can be replaced by NA. For example: 
   col=matrix(c("grey30","grey60",NA,"red","blue","green","orange",NA,NA),3,3,byrow=T).
 
 bin.size: the size of bin for SNP_density plot.
 
 bin.max: the max value of legend of SNP_density plot, the bin whose SNP number is bigger than 'bin.max' will be use the same color.
 
 pch: a number, the type for the points, is the same with "pch" in <plot>.
 
 band: a number, the space between chromosomes, the default is 1(if the band equals to 0, then there would be no space between chromosome).
 
 cir.band: a number, the space between circles, the default is 1.
 
 H: a number, the height for each circle, each circle represents a trait, the default is 1.
 
 ylim: a vector, the range of Y-axis when plotting the two type of Manhattans, is the same with "ylim" in <plot>.
 
 cex.axis: a number, controls the size of numbers of X-axis and the size of labels of circle plot.
 
 plot.type: a character or vector, only "d", "c", "m", "q" or "b" can be used. if plot.type="d", SNP density will be plotted; if plot.type="c", only circle-Manhattan plot will be plotted; if plot.type="m",only Manhattan plot will be plotted; if plot.type="q",only Q-Q plot will be plotted;if plot.type="b", both circle-Manhattan, Manhattan and Q-Q plots will be plotted; if plot.type=c("m","q"), Both Manhattan and Q-Q plots will be plotted.
 
 multracks: a logical,if multracks=FALSE, plotting multiple traits on multiple tracks, if it is TRUE, all Manhattan plots will be plotted in only one track.
 
 cex: a number or a vector, the size for the points, is the same with "size" in <plot>, and if it is a vector, the first number controls the size of points in circle plot(the default is 0.5), the second number controls the size of points in Manhattan plot(the default is 1), the third number controls the size of points in Q-Q plot(the default is 1)
 
 r: a number, the radius for the circle(the inside radius), the default is 1.
 
 xlab: a character, the labels for x axis.
 
 ylab: a character, the labels for y axis.
 
 xaxs: a character, The style of axis interval calculation to be used for the x-axis. Possible values are "r", "i", "e", "s", "d". The styles are generally controlled by the range of data or xlim, if given.
 
 yaxs: a character, The style of axis interval calculation to be used for the y-axis. See xaxs above..
 
 outward: logical, if outward=TRUE,then all points will be plotted from inside to outside.
 
 threshold: a number or vector, the significant threshold. For example, Bonfferoni adjustment method: threshold=0.01/nrow(Pmap). More than one significant line can be added on the plots, if threshold=0 or NULL, then the threshold line will not be added.
 
 threshold.col: a character or vector, the colour for the line of threshold levels.
 
 threshold.lwd: a number or vector, the width for the line of threshold levels.
 
 threshold.lty: a number or vector, the type for the line of threshold levels.
 
 amplify: logical, CMplot can amplify the significant points, if amplify=T, then the points greater than the minimal significant level will be highlighted, the default: amplify=TRUE.
 
 chr.labels: a vector, the labels for the chromosomes of circle-Manhattan plot.
 
 signal.cex: a number, if amplify=TRUE, users can set the size of significant points.
 
 signal.pch: a number, if amplify=TRUE, users can set the shape of significant points.
 
 signal.col: a character, if amplify=TRUE, users can set the colour of significant points, if signal.col=NULL, then the colors of significant points will not be changed.
 
 signal.line: a number, the width of the lines cross the circle
 
 cir.chr: logical, a boundary represents chromosome, the default is TRUE.
 
 cir.chr.h: a number, the width for the boundary, if cir.chr=FALSE, then this parameter will be useless.
 
 chr.den.col: a character or vector or NULL, the colour for the SNP density. If the length of parameter 'chr.den.col' is bigger than 1, SNP density that counts 
    the number of SNP within given size('bin.size') will be plotted around the circle. If chr.den.col=NULL, then the default colours are the same with the parameter "col" for circle.
 
 cir.legend: logical, whether to add the legend of each circle.
 
 cir.legend.cex: a number, the size of the number of legend.
 
 cir.legend.col: a character, the color of the axis of legend.
 
 LOG10: logical, whether to change the p-value into log10(p-value).
 
 box: logical, this function draws a box around the current Manhattan plot.
 
 conf.int.col: a character, the color of the confidence interval on QQ-plot.
 
 file.output: a logical, users can choose whether to output the plot results.
 
 file: a character, users can choose the different output formats of plot, so for, "jpg", "pdf", "tiff" can be selected by users.
 
 dpi: a number, the picture element for .jpg and .tiff files. The default is 300.
 
 memo: add a character to the output file name.
 
 verbose: whether print the reminder.

本文分享自微信公众号 - MedBioInfoCloud(MedBioInfoCloud)

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2019-03-27

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 2019诺贝尔生理学奖颁布,有关缺氧机制的信号通路研究是否会火?

    北京时间2019年10月7日17时30分(斯德哥尔摩时间10月7日11时30分),2019年诺贝尔生理学或医学奖获奖名单公布。获奖者为: William G. ...

    DoubleHelix
  • TCGAbiolinks包帮助文档(英文版)

    DoubleHelix
  • R语言数据分析与挖掘(第八章):判别分析(2)——贝叶斯(Bayes)判别分析

    Bayes判别,它是基于Bayes准则的判别方法,判别指标为定量资料,它的判别规则和最大似然判别、Bayes公式判别相似,都是根据概率大小进行判别,要求各类近似...

    DoubleHelix
  • Controlling Access to the Kubernetes API

    ? API Server Ports and IPs By default the Kubernetes API server serves HTTP o...

    Walton
  • imp错误IMP-00098: INTERNAL ERROR: impgst2Segmentation fault

    如果使用impdp要看dump的内容,可以使用sqlfile参数,他会将所有的DDL语句写入文件,

    bisal
  • JDK7并行计算框架介绍一 Fork/Join概述(官方原版-英文)

    New in the Java SE 7 release, the fork/join framework is an implementation of th...

    数据饕餮
  • <Solidity学习系列四>使用编译器

    Solidity存储库的一个构建目标是solc,solidity命令行编译器。 使用solc --help为您提供所有选项的解释。 编译器可以生成各种输出,范围...

    用户3539187
  • Methods in Go

    Go supports some object-orient programming features. Method is one of these feat...

    李海彬
  • CodeForces 24B F1 Champions(排序)

    B. F1 Champions time limit per test 2 seconds memory limit per test 256 me...

    ShenduCC
  • POJ-1414 Life Line (暴力搜索)

    Life Line Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 85...

    ShenduCC

扫码关注云+社区

领取腾讯云代金券