前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >eQTL分析中对转录组表达量的值进行分位数标准化和反正则转换

eQTL分析中对转录组表达量的值进行分位数标准化和反正则转换

作者头像
用户7010445
发布2023-12-26 14:40:16
1660
发布2023-12-26 14:40:16
举报

论文

Structural variants and short tandem repeats impact gene expression and splicing in bovine testis tissue

https://academic.oup.com/genetics/article/225/3/iyad161/7258327?login=false

代码

https://github.com/Meenu-Bhati/SV-STR/blob/main/RNA_quantification/TPM_normalization.r

https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py

https://github.com/broadinstitute/pyqtl/blob/master/qtl/norm.py

为啥要做这个分位数标准化和反正则转换暂时不太理解,这个链接给出一些解释 https://bsunfullsite.github.io/post/quantile-nomrlization-and-inverse-normal-transform/quantile-normalization-and-inverse-normal-transform/

也没看太明白,基本上做eQTL分析的论文里方法部分都会写这个,比如开头提到的论文,方法不部分写到

Finally, TPM values were quantile normalized and inverse normal transformed across samples per gene using the R package RNOmni

水稻泛基因组的论文 (A super pan-genomic landscape of rice),做eQTL分析,方法部分写到

To obtain a normal distribution of expression values for each gene, FPKM values of each gene were further normalized using the quantile-quantile normalization (qqnorm) function in R (version 3.1.2).

玉米eQTL的论文 (A role for heritable transcriptomic variation in maize adaptation to temperate environments)方法部分写到

For each gene, expression values were transformed using the Box-Cox method [103] prior to mapping

西红柿泛基因组 eQTL分析部分方法

Genes with TPM values of >0.5 were defined as expressed. Only genes expressed in at least 100 accessions were retained for the downstream analyses. Raw expression levels were normalized with quantile–quantile normalization

这篇文章里提供的代码链接 https://github.com/YaoZhou89/TGG/blob/main/5.Genetic_analysis/scripts/prepare_gene_expression.R

这里标准化是自己自定义的函数

代码语言:javascript
复制
quantile_normalisation <- function(df){
  df_rank <- apply(df,2,rank,ties.method="min")
  df_sorted <- data.frame(apply(df, 2, sort))
  df_mean <- apply(df_sorted, 1, mean)
  
  index_to_mean <- function(my_index, my_mean){
    return(my_mean[my_index])
  }
  
  df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean)
  rownames(df_final) <- rownames(df)
  return(df_final)
}

我试了一下这个函数的输出和preprocessCore::normalize.quantiles 这个函数的输出是一致的

https://github.com/broadinstitute/pyqtl/blob/master/qtl/norm.py 这个链接里提供了python做这个标准化的函数

代码语言:javascript
复制
def normalize_quantiles(df):
    """
    Quantile normalization to the average empirical distribution
    Note: replicates behavior of R function normalize.quantiles
          from library("preprocessCore")

    Reference:
     [1] Bolstad et al., Bioinformatics 19(2), pp. 185-193, 2003

    Adapted from https://github.com/andrewdyates/quantile_normalize
    """
    M = df.values.copy()

    Q = M.argsort(axis=0)
    m,n = M.shape

    # compute quantile vector
    quantiles = np.zeros(m)
    for i in range(n):
        quantiles += M[Q[:,i],i]
    quantiles = quantiles / n

    for i in range(n):
        # Get equivalence classes; unique values == 0
        dupes = np.zeros(m, dtype=np.int)
        for j in range(m-1):
            if M[Q[j,i],i] == M[Q[j+1,i],i]:
                dupes[j+1] = dupes[j]+1

        # Replace column with quantile ranks
        M[Q[:,i],i] = quantiles

        # Average together equivalence classes
        j = m-1
        while j >= 0:
            if dupes[j] == 0:
                j -= 1
            else:
                idxs = Q[j-dupes[j]:j+1,i]
                M[idxs,i] = np.median(M[idxs,i])
                j -= 1 + dupes[j]
        assert j == -1

    return pd.DataFrame(M, index=df.index, columns=df.columns)

开头提到的论文里除了分位数标准化还做了反正则转换,这个有现成的R包 RNOmni,代码

代码语言:javascript
复制
expr.int = t(apply(file_filter_norm, 1, RankNorm ))

eQTL分析还有一步是用peer这个包计算混杂因素(To remove potential batch effects and cconfounding factors),之前有一个困惑是直接用TPM值去计算混杂因素还是用标准化后的表达数据去计算这个混杂因素

https://github.com/broadinstitute/gtex-pipeline/tree/master/qtl

这个链接里有一些步骤,这里用的是标准化后的数据。计算多少个混杂因素这个链接里也给出了建议

The number of PEER factors was selected as function of sample size (N):

  • 15 factors for N < 150
  • 30 factors for 150 ≤ N < 250
  • 45 factors for 250 ≤ N < 350
  • 60 factors for N ≥ 350
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2023-12-21,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 小明的数据分析笔记本 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 论文
  • 代码
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档