asreml是非常强大的软件, 由于太强大, 很多人不会使用. 基因组选择在育种中的应用, 其基础是常规的系谱动物模型, 动物模型也可以很复杂, 看一下asreml的说明书就知道了, 有300多页, 据我了解, 其厚度可以用这个公式表示:
这说明一个问题, Arthur Gilmour教授(asreml的作者)是一个非常有耐心, 也非常厉害的统计学家, 他花费了自己的大半生, 将自己的心血编程了这个软件, 我很佩服.
这个教程是asreml在基因组选择和分子育种中的应用, 下面是我的读书笔记.
一个朋友说, 我们这个圈子很小了, 如果大家再不知道怎么分享, 怎么交流, 那我们这个学科以后怎么办呢, 这也是我停不下来的原因. 尼采说过: 力的过剩, 是力的证明. 他把不务正业说的这么理所应当, 搞得我将斜杠青年进行到底的决心变得更加稳固. 废话少说, 以下是目录.
这篇文档的主要目标是介绍ASReml在基因组分析中的实现方法, 它假定读者有一定的统计基础. 在本文档中, 不对统计和模型做过多的介绍.
示例数据:
ID,effect,SNP_1,SNP_100,SNP_1000,SNP_101,SNP_102,SNP_103,SNP_104,SNP_105,SNP_106,SNP_107,SNP_108,SNP_109,SNP_11,SNP_110,SNP_111,SNP_112,SNP_113,SNP_114
ID_1,-0.259731957336183,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_10,0.117554666740654,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
ID_100,0.00357380737732867,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_101,0.344906212015101,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0
ID_102,0.376403712779367,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
ID_103,0.131676984710817,0,0,0,0,1,1,0,1,1,1,0,0,0,0,0,0,0,0
ID_104,0.41299708896122,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
ID_105,0.353890056009646,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_106,0.237438809186312,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_107,-0.316455302927825,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
ID_108,-0.235784805404543,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_109,0.0783501427411017,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_11,0.0919863476998604,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
ID, 观测值为effect, 第三列及以后为SNP 名称.
将每个标记作为固定因子, 循环运行:
!cycle SNP_1 SNP_100 SNP_1000 SNP_101 SNP_102 SNP_103 SNP_104 SNP_105 SNP_106 SNP_107 SNP_108 SNP_109 SNP_11 SNP_110 SNP_111 SNP_112 SNP_113 SNP_114
dd.csv !SKIP 1
effect ~ mu $I
可以在asr文件中, 查看每个SNP的显著性, 这是单标记方差分析.
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 651.0 0.83 0.363
14 SNP_109 2 651.0 5.20 0.006
Finished: 19 Oct 2018 17:04:23.666 LogL Converged
Folder: D:\spline\snp-asreml
Cycle 13 value is SNP_11
Reading dd.csv FREE FORMAT skipping 1 lines
Univariate analysis of effect
Summary of 654 records retained of 654 read
Warning: Fewer levels found in SNP_1 than specified
Warning: Fewer levels found in SNP_101 than specified
Warning: Fewer levels found in SNP_104 than specified
Warning: Fewer levels found in SNP_11 than specified
Warning: Fewer levels found in SNP_112 than specified
Forming 3 equations: 3 dense.
Initial updates will be shrunk by factor 0.316
Notice: 1 singularities detected in design matrix.
1 LogL= 603.924 S2= 0.56887E-01 652 df
2 LogL= 603.924 S2= 0.56887E-01 652 df
- - - Results from analysis of effect - - -
LogL: 603.92 0.568871E-01 652 2 SNP_11 "LogL Converged"
Akaike Information Criterion -1205.85 (assuming 1 parameters).
Bayesian Information Criterion -1201.37
Model_Term Gamma Sigma Sigma/SE % C
Residual SCA_V 654 1.00000 0.568871E-01 18.06 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 652.0 0.82 0.366
15 SNP_11 1 652.0 1.25 0.264
Finished: 19 Oct 2018 17:04:24.058 LogL Converged
Folder: D:\spline\snp-asreml
Cycle 14 value is SNP_110
Reading dd.csv FREE FORMAT skipping 1 lines
Univariate analysis of effect
Summary of 654 records retained of 654 read
Warning: Fewer levels found in SNP_1 than specified
Warning: Fewer levels found in SNP_101 than specified
Warning: Fewer levels found in SNP_104 than specified
Warning: Fewer levels found in SNP_11 than specified
Warning: Fewer levels found in SNP_112 than specified
Forming 3 equations: 3 dense.
Initial updates will be shrunk by factor 0.316
1 LogL= 601.263 S2= 0.56936E-01 651 df
2 LogL= 601.263 S2= 0.56936E-01 651 df
- - - Results from analysis of effect - - -
LogL: 601.26 0.569356E-01 651 2 SNP_110 "LogL Converged"
Akaike Information Criterion -1200.53 (assuming 1 parameters).
Bayesian Information Criterion -1196.05
Model_Term Gamma Sigma Sigma/SE % C
Residual SCA_V 654 1.00000 0.569356E-01 18.04 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 651.0 0.82 0.366
16 SNP_110 2 651.0 0.85 0.429
Finished: 19 Oct 2018 17:04:24.499 LogL Converged
Folder: D:\spline\snp-asreml
Cycle 15 value is SNP_111
Reading dd.csv FREE FORMAT skipping 1 lines
Univariate analysis of effect
Summary of 654 records retained of 654 read
Warning: Fewer levels found in SNP_1 than specified
Warning: Fewer levels found in SNP_101 than specified
Warning: Fewer levels found in SNP_104 than specified
Warning: Fewer levels found in SNP_11 than specified
Warning: Fewer levels found in SNP_112 than specified
Forming 3 equations: 3 dense.
Initial updates will be shrunk by factor 0.316
1 LogL= 600.791 S2= 0.57054E-01 651 df
2 LogL= 600.791 S2= 0.57054E-01 651 df
- - - Results from analysis of effect - - -
LogL: 600.79 0.570539E-01 651 2 SNP_111 "LogL Converged"
Local CYCLE LogL Peak at CYCLE: 12 SNP_109 LogL: 605.70 Deviance: 12.35
Akaike Information Criterion -1199.58 (assuming 1 parameters).
Bayesian Information Criterion -1195.10
Model_Term Gamma Sigma Sigma/SE % C
Residual SCA_V 654 1.00000 0.570539E-01 18.04 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 651.0 0.81 0.367
17 SNP_111 2 651.0 0.17 0.843
Finished: 19 Oct 2018 17:04:24.962 LogL Converged
Folder: D:\spline\snp-asreml
Cycle 16 value is SNP_112
Reading dd.csv FREE FORMAT skipping 1 lines
Univariate analysis of effect
Summary of 654 records retained of 654 read
Warning: Fewer levels found in SNP_1 than specified
Warning: Fewer levels found in SNP_101 than specified
Warning: Fewer levels found in SNP_104 than specified
Warning: Fewer levels found in SNP_11 than specified
Warning: Fewer levels found in SNP_112 than specified
Forming 3 equations: 3 dense.
Initial updates will be shrunk by factor 0.316
Notice: 1 singularities detected in design matrix.
1 LogL= 602.714 S2= 0.56989E-01 652 df
2 LogL= 602.714 S2= 0.56989E-01 652 df
- - - Results from analysis of effect - - -
LogL: 602.71 0.569893E-01 652 2 SNP_112 "LogL Converged"
Akaike Information Criterion -1203.43 (assuming 1 parameters).
Bayesian Information Criterion -1198.95
Model_Term Gamma Sigma Sigma/SE % C
Residual SCA_V 654 1.00000 0.569893E-01 18.06 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 652.0 0.82 0.367
18 SNP_112 1 652.0 0.08 0.776
Finished: 19 Oct 2018 17:04:25.435 LogL Converged
Folder: D:\spline\snp-asreml
Cycle 17 value is SNP_113
Reading dd.csv FREE FORMAT skipping 1 lines
Univariate analysis of effect
Summary of 654 records retained of 654 read
Warning: Fewer levels found in SNP_1 than specified
Warning: Fewer levels found in SNP_101 than specified
Warning: Fewer levels found in SNP_104 than specified
Warning: Fewer levels found in SNP_11 than specified
Warning: Fewer levels found in SNP_112 than specified
Forming 3 equations: 3 dense.
Initial updates will be shrunk by factor 0.316
1 LogL= 601.723 S2= 0.57001E-01 651 df
2 LogL= 601.723 S2= 0.57001E-01 651 df
- - - Results from analysis of effect - - -
LogL: 601.72 0.570011E-01 651 2 SNP_113 "LogL Converged"
Akaike Information Criterion -1201.45 (assuming 1 parameters).
Bayesian Information Criterion -1196.97
Model_Term Gamma Sigma Sigma/SE % C
Residual SCA_V 654 1.00000 0.570011E-01 18.04 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 651.0 0.82 0.367
19 SNP_113 2 651.0 0.47 0.623
Finished: 19 Oct 2018 17:04:25.904 LogL Converged
Folder: D:\spline\snp-asreml
Cycle 18 value is SNP_114
Reading dd.csv FREE FORMAT skipping 1 lines
Univariate analysis of effect
Summary of 654 records retained of 654 read
Warning: Fewer levels found in SNP_1 than specified
Warning: Fewer levels found in SNP_101 than specified
Warning: Fewer levels found in SNP_104 than specified
Warning: Fewer levels found in SNP_11 than specified
Warning: Fewer levels found in SNP_112 than specified
Forming 3 equations: 3 dense.
Initial updates will be shrunk by factor 0.316
1 LogL= 606.497 S2= 0.56038E-01 651 df
2 LogL= 606.497 S2= 0.56038E-01 651 df
- - - Results from analysis of effect - - -
LogL: 606.50 0.560380E-01 651 2 SNP_114 "LogL Converged"
Local CYCLE LogL Peak at CYCLE: 18 SNP_114 LogL: 606.50 Deviance: 13.94
Akaike Information Criterion -1210.99 (assuming 1 parameters).
Bayesian Information Criterion -1206.51
Model_Term Gamma Sigma Sigma/SE % C
Residual SCA_V 654 1.00000 0.560380E-01 18.04 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 651.0 0.83 0.363
20 SNP_114 2 651.0 6.08 0.002
Best LogL 606.50 0.560380E-01 651 2 SNP_114 LogL Converged
Finished: 19 Oct 2018 17:04:26.403 LogL Converged
结果可以看出, 第20(SNP_114)个SNP达到极显著, 第16(SNP_109)个SNP达到显著水平.
我们也可以将其作为随机因子, 查看Log-likehood评价模型. 如果比空模型好(LRT检验), 那说明标记效应明显.
!cycle SNP_1 SNP_100 SNP_1000 SNP_101 SNP_102 SNP_103 SNP_104 SNP_105 SNP_106 SNP_107 SNP_108 SNP_109 SNP_11 SNP_110 SNP_111 SNP_112 SNP_113 SNP_114
dd.csv !SKIP 1
effect ~ mu !r $I
结果:
LogL: LogL Residual NEDF NIT Cycle Text
LogL: 607.75 0.564653E-01 653 6 SNP_1 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_100 "LogL Converged"
LogL: 606.11 0.569091E-01 653 7 SNP_1000 "LogL Converged"
LogL: 606.37 0.567870E-01 653 4 SNP_101 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_102 "LogL Converged"
LogL: 606.21 0.568392E-01 653 5 SNP_103 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_104 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_105 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_106 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_107 "LogL Converged"
LogL: 606.57 0.567311E-01 653 4 SNP_108 "LogL Converged"
LogL: 609.22 0.561598E-01 653 3 SNP_109 "LogL Converged"
LogL: 606.12 0.568872E-01 653 5 SNP_11 "LogL Converged"
Local CYCLE LogL Peak at CYCLE: 12 SNP_109 LogL: 609.22 Deviance: 6.22
LogL: 606.16 0.568635E-01 653 4 SNP_110 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_111 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_112 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_113 "LogL Converged"
LogL: 608.14 0.560577E-01 653 8 SNP_114 "LogL Converged"
Local CYCLE LogL Peak at CYCLE: 18 SNP_114 LogL: 608.14 Deviance: 4.06
同样的结果, 我们可以看到Local CYCLE中 达到Peak的点在SNP_109 6.22 和SNP_114 4.06, 说明这两个SNP位点达到显著性水平.
另一种写法, 应对标记比较多的情况, 不用每个标记都需要用!cycle指定名称, 可以用!G N, N是标记个数进行代替. 这种方法的缺点是没有SNP标记名称.
ID !A # ID_101
effect # 0.344906212015101
Marks !G 18
# !cycle SNP_1 SNP_100 SNP_1000 SNP_101 SNP_102 SNP_103 SNP_104 SNP_105 SNP_106 SNP_107 SNP_108 SNP_109 SNP_11 SNP_110 SNP_111 SNP_112 SNP_113 SNP_114
dd.csv !SKIP 1
!cycle 1:18
effect ~ mu !r Marks[$I]
结果:
LogL: LogL Residual NEDF NIT Cycle Text
LogL: 607.75 0.564653E-01 653 6 1 "LogL Converged"
LogL: 606.10 0.569091E-01 653 6 2 "LogL Converged"
LogL: 606.11 0.569091E-01 653 7 3 "LogL Converged"
LogL: 606.37 0.567870E-01 653 4 4 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 5 "LogL Converged"
LogL: 606.39 0.567814E-01 653 4 6 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 7 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 8 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 9 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 10 "LogL Converged"
LogL: 606.53 0.567416E-01 653 4 11 "LogL Converged"
LogL: 607.88 0.564391E-01 653 5 12 "LogL Converged"
LogL: 606.12 0.568872E-01 653 5 13 "LogL Converged"
Local CYCLE LogL Peak at CYCLE: 12 12 LogL: 607.88 Deviance: 3.55
LogL: 606.11 0.569077E-01 653 5 14 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 15 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 16 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 17 "LogL Converged"
LogL: 607.67 0.564839E-01 653 5 18 "LogL Converged"
Local CYCLE LogL Peak at CYCLE: 18 18 LogL: 607.67 Deviance: 3.12
查看sln中的BLUP值, 放到excel中排序, 可以看出两个标记比较大:
如果有每个标记的map位置, 我们就可以进行作图.
顾名思义, 就是讲所有Marks放在一起进行分析.
ID !A # ID_101
effect # 0.344906212015101
Marks !G 18
# !cycle SNP_1 SNP_100 SNP_1000 SNP_101 SNP_102 SNP_103 SNP_104 SNP_105 SNP_106 SNP_107 SNP_108 SNP_109 SNP_11 SNP_110 SNP_111 SNP_112 SNP_113 SNP_114
dd.csv !SKIP 1
# !cycle 1:18
# effect ~ mu !r Marks[$I]
# effect ~ mu # LogL= 606.105
effect ~ mu !r Marks
结果:
8 LogL= 607.362 S2= 0.55772E-01 653 df 0.1377E-01
Final parameter values 0.1378E-01
- - - Results from analysis of effect - - -
Akaike Information Criterion -1210.72 (assuming 2 parameters).
Bayesian Information Criterion -1201.76
Approximate stratum variance decomposition
Stratum Degrees-Freedom Variance Component Coefficients
Marks 17.30 0.965402E-01 53.0 1.0
Residual Variance 635.70 0.557723E-01 0.0 1.0
Model_Term Gamma Sigma Sigma/SE % C
Marks IDV_V 18 0.137842E-01 0.768776E-03 1.24 0 P
Residual SCA_V 654 1.00000 0.557723E-01 17.83 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
4 mu 1 73.6 1.52 0.222
Notice: The DenDF values are calculated ignoring fixed/boundary/singular
variance parameters using algebraic derivatives.
Solution Standard Error T-value T-prev
4 mu
1 -0.167809E-01 0.136271E-01 -1.23
3 Marks 18 effects fitted
空模型的log值是606, Mark模型是607, 轻微提高.
查看sln的BLUP值
理论介绍
GBLUP所依据的公式为:
M是n*m构成的矩阵, n是个体数, m为标记数(marker), g是每个标记的BLUP值. 随着标记数目的增加, m >>n的情况出现导致算法需要调整. 现在通用的是
如果已经计算出G矩阵, 可以使用asreml进行GBLUP的估算, 代码如下:
!work 12 !ARG 1
QTL ANALUSIS
id !P
SEX !A
AGE !A
HEIGHT !M -9999
idbgrm.ped !mark !alpha
ibdgrm.grm !ND !dense
ibdgrm.dat
HEIGHT ~ mu SEX !R nrm(id) grm1(id)
在下一章节中, 我们将对GS的延伸方法: Fast Bayes A进行介绍.
EM BayesA-like方法, 参考 Sun et al. (2012)开发而成.
一般标记矩阵的编码方法为: 0 1 2,
构建矩阵的方法, 公式为:
具体参数:
Bayes A, 假定性状是由主效QTL控制, 少数QTL解释了一大半的变异, 而不是像GBLUP所假定每个标记的有相同的方差(符合正态分布)
Fast Bayes A:
Bayes B的方法在asreml中实现:
marker文件格式:
标记文件的命令参数, 这些参数都需要和标记文件放在同一行才可以起作用 filename.mkr
权重G矩阵
常规GBLUP命令
!wrokspace 1
title: standard GBLUP model
ID *
phenotype
genotype.mrk !markers 10031 !IDS 3226 # 标记文件有10031个SNP, ID有3225个
phenotype.txt !skip 1 !maxit 50 !gdense #使用稠密矩阵(dense)
phenotype ~ mu !r grm1(ID)
residual units
结果说明
Fast Bayes A方法命令 很多时候, 我们对一些效应较大的标记感兴趣, 例如QTL, 但是GBLUP估计是收缩是估计(shrunken estimators), QTL的效应值会被周围的标记吸收掉, 导致大效应标记难以发现. Bayes A的模型可以鉴定少数大效应的标记, 这里的Fast Bayes-A like 方法类似. 对于一些性状, Fast Bayes-A比GBLUP的预测效果更好.
调整对角线D
常规Fast-BayesA命令
!wrokspace 1
title: Fast-BayesA model
ID !A
phenotype
genotype.mrk !markers 10031 !IDS 3226 !FBA 4.2 # 标记文件有10031个SNP, ID有3225个, !FBA 设置为4.2
phenotype.txt !skip 1 !maxit 50
phenotype ~ mu !r grm1(ID) 0.808 !GF # 这里Vg的gamma设置为0.808, 固定方差组分
residual units
结果说明
不同的K值, Vg是固定还是估计 比较
结论:
基因型的GBLUP在.sln中, mark的效应在.mef中, 标记的权重(weight)在.mef中, 大效应的标记在.res文件中.
如果已经鉴定出大效应的SNP, 可以放在模型中, 这样模型就可以利用GWAS和QTL的信息, 提高预测的准确性.
snp(ID, 954) snp(ID,4480)
可以作为固定因子, 或者随机因子.
GS中, 多性状GS模型的效果要高于单性状GS, asreml中有很多强大的函数可以利用, 未来可期.