eQTL 分析是常用的多组学整合分析方法,使我们可以将基因表达水平的变化与基因型联系起来,有助于揭示生命系统的生理生化过程,发现导致某些疾病的遗传因素,以及确定受他们影响的生物学通路。
今天给大家介绍的是用 Matrix eQTL 进行 eQTL 分析,这个 R 包 2012 年发表在 Bioinformatics 上,在测试数据中可以看到 Matrix eQTL 的计算速度非常快。
软件官网:http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL
定义:表达数量性状基因座(expression Quantitative Trait Loci,eQTL)指与单个基因 mRNA 表达量相关的 DNA 突变。eQTL 可分为 cis-eQTL 和 trans-eQTL,前者就是某个基因的 eQTL 定位到该基因所在的基因组区域,表明可能是该基因本身的差别引起的 mRNA 水平变化;后者是指某个基因的 eQTL 定位到其他基因组区域,表明其他基因的差别控制该基因 mRNA 水平的差异。
install.packages("MatrixEQTL")
学习一个 R 包,最直接的方式就是运行示例代码和示例数据,Matrix eQTL 也不例外。第一步,加载程序包:
library("MatrixEQTL")
示例数据地址:
## Location of the package with the data files.
base.dir = find.package('MatrixEQTL');
•基因型数据 SNP.txt
,列为样本,行为 SNP,Matrix eQTL 需要以 012 的形式来表示基因型
snpid Sam_01 Sam_02 Sam_03 Sam_04 Sam_05 Sam_06 Sam_07 Sam_08 Sam_09 Sam_10 Sam_11 Sam_12 Sam_13 Sam_14 Sam_15 Sam_16
Snp_01 2 0 2 0 2 1 2 1 1 12 2 1 2 2 1
Snp_02 0 1 1 2 2 1 0 0 0 11 1 1 0 1 1
Snp_03 1 0 1 0 1 1 1 1 0 11 0 1 1 1 2
Snp_04 0 1 2 2 2 1 1 0 0 01 2 1 1 1 0
Snp_05 1 1 2 1 1 2 1 1 0 11 2 0 1 2 1
Snp_06 2 2 2 1 1 0 1 0 2 11 1 2 0 2 1
Snp_07 1 1 2 2 0 1 1 1 1 02 2 0 1 1 1
Snp_08 1 0 1 0 1 0 0 1 1 10 2 0 1 1 1
...
•基因表达矩阵 GE.txt
geneid Sam_01 Sam_02 Sam_03 Sam_04 Sam_05 Sam_06 Sam_07 Sam_08 Sam_09 Sam_10 Sam_11 Sam_12 Sam_13 Sam_14 Sam_15 Sam_16
Gene_01 4.91 4.63 5.18 5.07 5.74 5.09 5.31 5.29 4.73 5.72 4.75 4.54 5.01 5.03 4.84 4.44
Gene_02 13.78 13.14 13.18 13.04 12.85 13.07 13.09 12.83 14.94 13.86 15.26 14.73 14.08 14.33 14.72 14.97
Gene_03 12.06 12.29 13.07 13.65 13.86 12.84 12.29 13.03 13.13 14.93 12.40 13.38 13.70 14.49 14.14 13.35
Gene_04 11.63 11.88 12.74 12.66 13.16 11.99 11.97 12.81 11.74 13.29 10.88 11.37 12.09 12.50 11.40 11.45
Gene_05 14.72 14.66 14.63 15.91 15.46 14.74 15.17 15.01 14.05 14.90 12.96 13.56 14.06 14.44 14.12 13.76
Gene_06 12.29 12.23 12.47 13.21 12.63 12.18 12.44 12.45 13.22 12.70 12.93 12.84 13.20 13.19 12.64 12.80
Gene_07 12.56 12.71 12.49 13.41 13.60 12.35 12.27 13.42 14.73 15.47 13.85 14.31 15.25 15.55 14.90 14.04
Gene_08 12.27 12.58 12.61 13.02 12.86 12.32 12.30 13.19 15.21 14.60 14.72 14.56 14.98 14.92 14.53 14.72
Gene_09 9.82 9.29 8.95 8.18 8.11 9.43 9.17 9.25 10.95 9.82 12.60 11.99 11.20 10.31 11.13 11.69
...
•协变量信息 Covariates.txt
id Sam_01 Sam_02 Sam_03 Sam_04 Sam_05 Sam_06 Sam_07 Sam_08 Sam_09 Sam_10 Sam_11 Sam_12 Sam_13 Sam_14 Sam_15 Sam_16
gender 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
age 36 40 46 65 69 43 40 54 44 70 24 34 55 62 48 40
...
•基因位置信息 geneloc.txt
基因位置信息,包括四列:基因名,染色体,以及起始和终止位置
geneid chr left right
Gene_01 chr1 721289 731289
Gene_02 chr1 752565 762565
Gene_03 chr1 777121 787121
Gene_04 chr1 785988 795988
Gene_05 chr1 792479 802479
Gene_06 chr1 798958 808958
Gene_07 chr1 888658 898658
Gene_08 chr1 918572 928572
Gene_09 chr1 926430 936430
...
•SNP 位置信息 snpsloc.txt
SNP 位置信息,包括三列:SNP 名称,染色体,以及位置
snpid chr pos
Snp_01 chr1 721289
Snp_02 chr1 752565
Snp_03 chr1 777121
Snp_04 chr1 785988
Snp_05 chr1 792479
Snp_06 chr1 798958
Snp_07 chr1 888658
Snp_08 chr1 918572
Snp_09 chr1 926430
...
接着设置参数,这里选择线性模型。
# Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS
useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS
# Genotype file name
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
# Gene expression file name
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
另外还需要协变量文件,如果没有协变量,则可将变量 covariates_file_name
设为 character()
。
# Covariates file name
# Set to character() for no covariates
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");
# Output file name
output_file_name = tempfile();
接下来设置 p 值的阈值,确定输出文件 output_file_name
中的结果。注意,对于较大的数据集,阈值应设置的较小,否则可能会导致输出文件过大。
# Only associations significant at this level will be saved
pvOutputThreshold = 1e-2;
最后,定义误差协方差矩阵,这个参数很少用到。
# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric();
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");
下一部分包含三块非常相似的代码,分别导入基因型、基因表达和协变量文件。在每个部分中,可以设置文件分隔符(如,制表符 "\t"
、逗号“ ","
,或空格 " "
)、缺失值、带有列名的行以及带有行名的列。
## Load genotype data
snps = SlicedData$new();
snps$fileDelimiter = "\t"; # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1; # one row of column labels
snps$fileSkipColumns = 1; # one column of row labels
snps$fileSliceSize = 2000; # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name);
## Load gene expression data
gene = SlicedData$new();
gene$fileDelimiter = "\t"; # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1; # one row of column labels
gene$fileSkipColumns = 1; # one column of row labels
gene$fileSliceSize = 2000; # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);
## Load covariates
cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t"; # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1; # one row of column labels
cvrt$fileSkipColumns = 1; # one column of row labels
if(length(covariates_file_name)>0) {
cvrt$LoadFile(covariates_file_name);
}
最后,运行 Matrix eQTL 的主函数:
## Run the analysis
me = Matrix_eQTL_engine(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name,
pvOutputThreshold = pvOutputThreshold,
useModel = useModel,
errorCovariance = errorCovariance,
verbose = TRUE,
pvalue.hist = TRUE,
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE);
unlink(output_file_name);
每个小于 P 值的记录都会保存于 me
变量中。每条记录包含 SNP 名称、基因名称、beta 值、P 值和 FDR 。
show(me$all$eqtls)
snps gene statistic pvalue FDR beta
1 Snp_05 Gene_03 38.812160 5.515519e-14 8.273279e-12 0.4101317
2 Snp_13 Gene_09 -3.914403 2.055817e-03 1.541863e-01 -0.2978847
3 Snp_11 Gene_06 -3.221962 7.327756e-03 2.853368e-01 -0.2332470
4 Snp_04 Gene_10 3.201666 7.608981e-03 2.853368e-01 0.2321123
5 Snp_14 Gene_01 3.070005 9.716705e-03 2.915011e-01 0.2147077
Matrix eQTL 可以根据距离分别计算 cis-eQTL 和 trans- eQTL。和上面代码的区别在于需要多设置三个参数和两个输入文件:
•pvOutputThreshold.cis
设置 cis-eQTLs 的 P value 阈值•output_file_name.cis
设置 cis-eQTLs 输出文件名•cisDist
定义 SNP 和基因的最远距离•snpspos
SNP 位置信息,包括三列:SNP 名称,染色体,以及位置•genepos
基因位置信息,包括四列:基因名,染色体,以及起始和终止位置
library(MatrixEQTL)
## Location of the package with the data files.
base.dir = find.package('MatrixEQTL');
# base.dir = '.';
## Settings
# Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS
useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS
# Genotype file name
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
snps_location_file_name = paste(base.dir, "/data/snpsloc.txt", sep="");
# Gene expression file name
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
gene_location_file_name = paste(base.dir, "/data/geneloc.txt", sep="");
# Covariates file name
# Set to character() for no covariates
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");
# Output file name
output_file_name_cis = tempfile();
output_file_name_tra = tempfile();
# Only associations significant at this level will be saved
pvOutputThreshold_cis = 2e-2;
pvOutputThreshold_tra = 1e-2;
# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric();
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");
# Distance for local gene-SNP pairs
cisDist = 1e6;
## Load genotype data
snps = SlicedData$new();
snps$fileDelimiter = "\t"; # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1; # one row of column labels
snps$fileSkipColumns = 1; # one column of row labels
snps$fileSliceSize = 2000; # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name);
## Load gene expression data
gene = SlicedData$new();
gene$fileDelimiter = "\t"; # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1; # one row of column labels
gene$fileSkipColumns = 1; # one column of row labels
gene$fileSliceSize = 2000; # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);
## Load covariates
cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t"; # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1; # one row of column labels
cvrt$fileSkipColumns = 1; # one column of row labels
if(length(covariates_file_name)>0) {
cvrt$LoadFile(covariates_file_name);
}
## Run the analysis
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE);
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE);
me = Matrix_eQTL_main(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name_tra,
pvOutputThreshold = pvOutputThreshold_tra,
useModel = useModel,
errorCovariance = errorCovariance,
verbose = TRUE,
output_file_name.cis = output_file_name_cis,
pvOutputThreshold.cis = pvOutputThreshold_cis,
snpspos = snpspos,
genepos = genepos,
cisDist = cisDist,
pvalue.hist = "qqplot",
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE);
unlink(output_file_name_tra);
unlink(output_file_name_cis);
如果想用自己的数据进行分析,直接将上面代码中的五个输入文件替换即可。
•genotype•expression•covariates•gene location•SNP location