TwoSampleMR包,经典程度无需赘言,先跟着官方教程来小试牛刀吧——
它的自我介绍如下:
一个使用 GWAS 摘要数据执行孟德尔随机化的软件包。
它使用 IEU GWAS 数据库自动获取数据,并使用多种方法进行分析。
包如其名,TwoSampleMR主要是为两样本孟德尔随机化分析而准备的,在应用这个包以前,我们来看看它的核心函数及其功能:
install.packages("remotes")
remotes::install_github("MRCIEU/TwoSampleMR")
library(TwoSampleMR)
假如你的暴露数据是从现有的GWAS文章中download来的,那就用这招:
random_df <- data.frame(
SNP = c("rs1", "rs2"),
beta = c(1, 2),
se = c(1, 2),
effect_allele = c("A", "T")
)
random_df
random_exp_dat <- format_data(random_df, type = "exposure") ##规范为MR分析的标准格式
random_exp_dat
假如你想用现成的GWAS数据库并从找到感兴趣的数据:
remotes::install_github("MRCIEU/MRInstruments")
library(MRInstruments) ##该软件包包含许多 data.frames,其中每个都是 SNP 与性状关联的存储库。
data(gwas_catalog)
head(gwas_catalog)
例如,想用 Speliotes 等人 2010 年的研究获得BMI的工具变量,
bmi_gwas <-
subset(gwas_catalog,
grepl("Speliotes", Author) &
Phenotype == "Body mass index")
bmi_exp_dat <- format_data(bmi_gwas)
如果想要外周血中的代谢物相关QTL,
data(metab_qtls)
head(metab_qtls)
ala_exp_dat <- format_metab_qtls(subset(metab_qtls, phenotype == "Ala")) ##某个代谢物的QTL
蛋白相关QTL也同理,
data(proteomic_qtls)
head(proteomic_qtls)
apoh_exp_dat <- format_proteomic_qtls(subset(proteomic_qtls, analyte == "ApoH")) ##获得载脂蛋白 H 蛋白的QTL
某个基因的,
data(gtex_eqtl)
head(gtex_eqtl)
irak1bp1_exp_dat <- format_gtex_eqtl(subset(
gtex_eqtl,
gene_name == "IRAK1BP1" & tissue == "Adipose Subcutaneous"
)) ##某个组织中的某个基因相关的QTL
某个性状的某个甲基化位点相关QTL,
data(aries_mqtl)
head(aries_mqtl)
cg25212131_exp_dat <- format_aries_mqtl(subset(aries_mqtl, cpg == "cg25212131" &
age == "Birth"))
更常见的情况是,我们通过IEU gwas数据库找到了感兴趣的ID,比如ieu-a-2
:
bmi2014_exp_dat <- extract_instruments(outcomes = 'ieu-a-2')
该函数将返回一组对BMI具有 GWAS 意义的LD clumped SNP
。您可以为该函数指定各种参数:
p1
= P-value threshold for keeping a SNPclump
= Whether or not to return independent SNPs only (default is TRUE
)r2
= The maximum LD R-square allowed between returned SNPskb
= The distance in which to search for LD R-square values这些阈值是在筛选暴露相关的SNPs,就是所谓的工具变量时需要我们去设定的,没有固定的标准。
接下来是Clumping
步骤,其目的是:
bmi_exp_dat <- clump_data(bmi_exp_dat)
需要注意的是,只有种群内 MAF > 0.01 的 SNP 是可用的。
ao <- available_outcomes()
bmi_exp_dat <- extract_instruments(outcomes = 'ieu-a-2')
chd_out_dat <- extract_outcome_data(snps = bmi_exp_dat$SNP, outcomes = 'ieu-a-7') ##这个ID也是根据感兴趣的结局去IEU数据库里找到的,运行完这里的第一行代码能得到很多ID,按需挑选
如果你的暴露仍然是BMI,结局数据是来自本地的,存在这个文件夹里➡gwas_summary.csv
,其行名:【rsid,effect,SE,a1,a2,a1_freq,p-value,Units,Gene,n】,那么:
outcome_dat <- read_outcome_data(
snps = bmi_exp_dat$SNP,
filename = "gwas_summary.csv",
sep = ",",
snp_col = "rsid",
beta_col = "effect",
se_col = "SE",
effect_allele_col = "a1",
other_allele_col = "a2",
eaf_col = "a1_freq",
pval_col = "p-value",
units_col = "Units",
gene_col = "Gene",
samplesize_col = "n"
)
dat <- harmonise_data(
exposure_dat = bmi_exp_dat,
outcome_dat = chd_out_dat
)
一旦暴露和结果数据得到统一,我们就可以得到暴露和结果性状中每个工具 SNP 的效应和标准误差。我们可以利用这些信息进行孟德尔随机化。为此,只需运行:
res <- mr(dat)
res
mr_method_list()##查看mr支持的MR分析方法
mr(dat, method_list = c("mr_egger_regression", "mr_ivw"))#可以选择自己想要的方法
mr_heterogeneity(dat)
mr_pleiotropy_test(dat)
返回一个与 mr()
输出结果类似的 data.frame,但它会对每个暴露-结果组合进行多次分析,每次使用不同的单 SNP 进行分析。
res_single <- mr_singlesnp(dat)
res_loo <- mr_leaveoneout(dat)
p3 <- mr_leaveoneout_plot(res_loo)
res <- mr(dat)
p1 <- mr_scatter_plot(res, dat)
res_single <- mr_singlesnp(dat)
p2 <- mr_forest_plot(res_single)
res_single <- mr_singlesnp(dat)
p4 <- mr_funnel_plot(res_single)
1 对多的 MR 分析探究单个暴露对多个结果或多个暴露对单个结果的影响。这种分析的结果可以使用 1 对多森林图进行可视化,无论是否对分类变量进行分层。从可视化的角度来看,该功能最适合 50 个或更少的结果,并不适合处理 100 个以上的结果。如果结果数远大于 50 个,最好将其分成两个独立的图。
例如,如果您有 100 组结果,您可以将这些结果平均分配到两个图中,然后在 Powerpoint 等其他程序中将这两个图合并在一起。该功能假定结果已经按照正确的顺序绘制。因此,建议用户根据自己希望在绘图中显示的方式对结果进行排序。用户可以使用自己的代码来完成这项工作,也可以使用 sort_1_too_many()
函数。
exp_dat <- extract_instruments(outcomes = c(2, 100, 1032, 104, 1, 72, 999))
table(exp_dat$exposure)
chd_out_dat <- extract_outcome_data(
snps = exp_dat$SNP,
outcomes = 7
)
dat2 <- harmonise_data(
exposure_dat = exp_dat,
outcome_dat = chd_out_dat
)
res <- mr(dat2)
res <- subset_on_method(res) # default is to subset on either the IVW method (>1 instrumental SNP) or Wald ratio method (1 instrumental SNP).
res <- sort_1_to_many(res, b = "b", sort_action = 4) # this sorts results by decreasing effect size (largest effect at top of the plot)
res <- split_exposure(res) # to keep the Y axis label clean we exclude the exposure ID labels from the exposure column
res$weight <- 1/res$se
min(exp(res$b - 1.96*res$se)) # identify value for 'lo' in forest_plot_1_to_many
max(exp(res$b + 1.96*res$se)) # identify value for 'up' in forest_plot_1_to_many
forest_plot_1_to_many(
res,
b = "b",
se = "se",
exponentiate = TRUE,
ao_slc = FALSE,
lo = 0.3,
up = 2.5,
TraitM = "exposure",
col1_width = 2,
by = NULL,
trans = "log2",
xlab = "OR for CHD per SD increase in risk factor (95% confidence interval)",
weight = "weight"
)
res$pval<-formatC(res$pval, format = "e", digits = 2)
forest_plot_1_to_many(
res,
b = "b",
se = "se",
exponentiate = TRUE,
ao_slc = FALSE,
lo = 0.3,
up = 2.5,
TraitM = "exposure",
by = NULL,
trans = "log2",
xlab = "OR for CHD per SD increase in risk factor (95% CI)",
weight = "weight",
subheading_size = 11,
col1_title = "Risk factor",
col1_width = 2.5,
col_text_size = 4,
addcols = c("nsnp", "pval"),
addcol_widths = c(1.0, 1.0),
addcol_titles = c("No. SNPs", "P-val")
)
forest_plot_1_to_many(
res,
b = "b",
se = "se",
exponentiate = TRUE,
ao_slc = FALSE,
lo = 0.3,
up = 3.0,
TraitM = "exposure",
col1_width = 2.0,
by = NULL,
trans = "log2",
xlab = "",
addcols = c("nsnp", "pval"),
weight = "weight",
col_text_size = 4,
addcol_widths = c(0.5, 1.0),
addcol_titles = c("", "")
)
exp_dat <- extract_instruments(outcomes = 2) # extract instruments for BMI
ao <- available_outcomes()
ao <- ao[ao$category == "Disease", ] # identify diseases
ao <- ao[which(ao$ncase > 100), ]
dis_dat <- extract_outcome_data(
snps = exp_dat$SNP,
outcomes = ao$id
)
dat3 <- harmonise_data(
exposure_dat = exp_dat,
outcome_dat = dis_dat
)
res <- mr(dat3, method_list = c("mr_wald_ratio", "mr_ivw"))
res <- split_outcome(res) # to keep the Y axis label clean we exclude the exposure ID labels from the exposure column
res <- sort_1_to_many(res, b = "b", sort_action = 4) # this sorts results by decreasing effect size (largest effect at top of the plot)
res1 <- res[1:52, ]
res2 <- res[53:103, ]
plot1 <- forest_plot_1_to_many(
res1,
b = "b",
se = "se",
exponentiate = TRUE,
trans = "log2",
ao_slc = FALSE,
lo = 0.004,
up = 461,
col1_width = 2,
TraitM = "outcome",
col_text_size = 3,
xlab = ""
)
plot2 <- forest_plot_1_to_many(
res2,
b = "b",
se = "se",
exponentiate = TRUE,
trans = "log2",
ao_slc = FALSE,
lo = 0.004,
up = 461,
col1_width = 2,
TraitM = "outcome",
subheading_size = 11,
col_text_size = 3,
xlab = ""
)
plot1
plot2
pdf("plot1.pdf", height = 10, width = 8)
plot1
dev.off()
MR.RAPS(Robust Adjusted Profile Score)是最近提出的一种方法,它考虑了 SNP 暴露效应的测量误差,在存在许多(如数百个)弱工具变量时无偏倚,并对系统性和特异性多效性具有鲁棒性。
res <- mr(dat, method_list = c("mr_raps"))
在 MR 中,假定工具首先影响暴露,然后通过暴露影响结果。但有时这很难评估,例如顺式作用 SNP 是先影响基因表达水平还是先影响 DNA 甲基化水平?这时就可以使用 Steiger 检验来检验假设暴露与结果之间的因果方向。
out <- directionality_test(dat)
这个神奇的函数神奇到可直接生成一份报告,执行所有MR分析、敏感性分析和绘图函数后,将其结果显示在一个独立的 html 网页、word 文档或 pdf 文档中。
mr_report(dat)