RNA-seq数据分析工具详细介绍(从质量控制到可视化)

  • Methods and tools for RNA-seq-based co-expression network analysis
  • 非常全面,从质量控制开始到最后都有介绍,包括描述及优缺点

原始文章这里下载 Excel表格在我百度盘,点此下载 密码:cz06

如果你用的是手机,建议移步这篇文章

Tool/method

Description,strengths(+) and limitations(-)

Quality control

FastQC

• A tool that uses .fastq, .bam or .sam files to identify and highlight potential issues in the data, such as low base quality scores, low sequence quality and GC content biases. + Can be used either with or without user interface. − Uses only the first 200 000 sequences in the file.

RSeQC

+ A tool with a wider range of quality control measures than FastQC. + Can also be used on mapped data to obtain information on metrics such as the prevalence of splicing events.

QoRTs

+ This is a similar tool to RSeQC but incorporates more quality control metrics.

Read Mappers

Bowtie/Tophat/Tophat2

• The first widely used mapping tool.

STAR

+ Detects splice variants. − Currently much slower than most other mappers and requires a relatively large amount of memory. • A widely used tool to align reads to a genome.+ Maps ∼50 times faster than Tophat and Tophat2.+ Commonly used tool to detect novel splice variants. − Uses a large amount of memory (>20 GB for mapping to the human genome).

HISAT

• A widely used tool to align reads to a genome at a faster rate than STAR with comparable accuracy. + HISAT2 is expected to be the core of the next version of Tophat (Tophat3). + Detects novel splice variants. + The newer HISAT2 version aligns to genotype variants, likely achieving higher accuracy. + Uses less memory than STAR (<8 GB for mapping to the human genome using default settings).

BWA

• A commonly used aligner for species in which splicing does not occur. − Does not detect splice variants.

Kallisto

• A tool that uses a pseudoalignment strategy to assign expression values to transcripts/genes to achieve optimal speed. • Comparable accuracy to other tools using real alignment strategies. • Reports reads/expression per gene instead of read alignment coordinates (which are commonly used to acquire the expression per gene). + Uses little memory and can be run on a regular desktop computer. − Does not identify novel splice variants

Salmon

• Another pseudoalignment tool. Performance comparable with Kallisto. • Reports reads/expression per gene instead of read alignment coordinates (which are commonly used to acquire the expression per gene). − Does not identify novel splice variants.

Read counting tools

FeatureCounts

+ A tool that is similar to HTseq but much faster. Results are slightly different owing to slightly different expression assignment strategies.

SpliceNet

• A tool that divides the reads mapping to an exon shared with two isoforms proportionally to the total expression of each of the two whole isoforms. + Estimates expression more accurately when multiple genes/transcripts partly share the same genome regions.

Normalization

FPKM/RPKM

• Widely used normalization methods that correct for the total number of reads in a sample while accounting for gene length. − TMM has been suggested as a better alternative

TPM

• A method similar to FPKM, but normalizes the total expression to 1 million, i.e. the summed expression of TPM-normalized samples is always 1 million.

TMM

• Similar to FPKM/RPKM but puts expression measures on a common scale across different samples.

RAIDA

• A method that uses ratios between counts of genes in each sample for normalizations. + Avoids problems caused by differential transcript abundance between samples (resulting from differential expression of highly abundant gene transcripts).

DEseq2

• A normalization method that adjusts the expression values of each gene in a sample by a set factor. This factor is determined by taking the median gene expression in a sample after dividing the expression of each gene by the geometric mean of the given gene across all samples. This differs from the normalization implemented in the DEseq2 differential expression analysis. • Implemented into the DEseq2 R package.

Correction for batch effects

Limma-removeBatchEffect

• A method which uses linear models to correct for batch effects.

Svaseq

• This method estimates biases based on genes that have no phenotypic expression effects, which are then used for correction of the data. • Specifically designed for RNA-seq data.

Combat

• A method that is robust to outliers and also effective at batch effect correction in small sample sizes (<25).

Co-expression module detection

WGCNA

• A tool that constructs a co-expression network using Pearson correlation (default) or a custom distance measure.• Uses hierarchical clustering and has various ‘tree cutting’ options to identify modules. + Most widely used tool, well supported and documented.

DiffCoEx

• A method that uses a similar approach to WGCNA to identify and group differentially co-expressed genes instead of identifying co-expressed modules.

DICER

• A method that identifies modules that correlate differently between sample groups, e.g. modules that form one large interconnected module in one group compared with several smaller modules in another group.

CoXpress

• A tool that identifies co-expression modules in each sample group and tests whether the genes within these modules are also co-expressed in other groups.

DINGO

• DINGO is a more recent tool that groups genes based on how differently they behave in a particular subset of samples (representing e.g. a particular condition) from the baseline co-expression determined from all samples

GSCNA

• A tool that tests whether a predefined defined gene set is differentially expressed between two sample groups.

GSVD

• A method that identifies ‘genelets’, which can be interpreted as modules representing partial co-expression signals from multiple genes. These signals are then compared between two groups to identify genelets unique to samples and genelets that are shared between the two groups.

HO-GSVD

• A tool similar to GSVD, but that can be used across multiple sample groups rather than only two.

Biclustering

• A group of methods that identify modules that are unique to a subpopulation of samples without the need for prior grouping of samples.

Functional enrichment

PANTHER

• A tool that uses a comprehensive protein library combined with human curated pathways and evolutionary ontology. • If a gene is not in the library, it is classified based on its protein sequence conservation and by finding a related gene.

DAVID

• A widely used tool with an online web interface. Users supply a list of genes and select the annotation categories from various sources to identify enrichment.

g:Profiler

• A tool that performs enrichment analyses for gene ontologies, KEGG pathways, protein–protein interactions, TF and miRNA binding sites. + Also available as an R package.

ClusterProfiler

• An R package for overrepresentation and gene set enrichment analyses for several curated gene sets. + Allows users to compare the results of analyses performed on several gene sets.

Enrichr

• An intuitive web tool for performing gene overrepresentation analyses using a comprehensive set of functional annotations.

ToppGene

• An intuitive tool that determines enrichment of different categories such as GO terms, chromosomal locations and disease associations. Enrichment for TFBS and miRNA+ Also has other functions, such as candidate gene prioritization, based on network structures.

Regulatory network inference

ARACNE

• A tool that removes indirect connections between genes (i.e. partners of a gene that have a stronger correlation with each other than with the gene itself), leaving only those connections that are expected to be regulatory. + Creates directional networks.

Genie3

• A tool that incorporates TF information to construct a regulatory network by determining the TF expression pattern that best explains the expression of each of their target genes. + Creates directional networks. − Requires TF information.

CoRegNet

• A tool that identifies co-operative regulators of genes from different data types.

cMonkey

• Calculates joint bicluster membership probability from different data types by identifying groups of genes that group together in multiple data types.

Visualization

Cystoscape

• A widely used tool for the visualization of networks. + Has many plug-ins available for specific analyses.

BioLayout

• Similar to Cytoscape but less widely used. + Can load and visualize much larger networks than Cytoscape.

Co-expression databasesa

COXPRESdb

• A web resource incorporating 12 co-expression networks for different species created from ∼157 000 microarrays and 10 000 RNA-seq samples. Has a focus on protein-coding RNAs.

GeneFriends

• Human and mouse gene and transcript co-expression networks. • Networks constructed from ∼4000 RNA-seq samples each. + Includes a number of non-coding RNAs (∼10 000 for mouse and ∼25 000 for human).

GeneMANIA

• Also includes physical and genetic interaction, co-localization, pathway and shared protein domain information data sets. + Networks for nine species.

GENEVESTIGATOR

• A database constructed using ∼145 000 samples. + Curated database. + Networks for 18 species. + Multiple data types.

GIANT

• Tissue-specific interaction network database. • Includes 987 Datasets encompassing 38 000 conditions describing 144 tissues types. + Integrates physical interaction, co-expression, miRNA binding motif and TF binding site data.

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏数据结构与算法

1020 孪生蜘蛛

1020 孪生蜘蛛 时间限制: 1 s 空间限制: 128000 KB 题目等级 : 黄金 Gold 题目描述 Description 在G城保卫...

3185
来自专栏数据结构与算法

P2668 斗地主 贪心+深搜

题目描述 牛牛最近迷上了一种叫斗地主的扑克游戏。斗地主是一种使用黑桃、红心、梅花、方片的A到K加上大小王的共54张牌来进行的扑克牌游戏。在斗地主中,牌的大小关系...

4079
来自专栏WindCoder

Best Programming Editors? A Never Ending Battle With No Clear Winner

原文:Best Programming Editors? A Never Ending Battle With No Clear Winner

761
来自专栏Y大宽

金黄葡萄球菌RNA-seq数据分析

这里出现问题了,突变株的比对率太低,不到1%,这是不可能的,怀疑样品污染,然后随机挑选了5条序列blast了下,发现应该是被溶血葡萄球菌污染。

1622
来自专栏码匠的流水账

spring webflux文件上传下载

使用webflux就没有之前基于servlet容器的HttpServletRequest及HttpServletReponse了,取而代之的是org.sprin...

2901
来自专栏PPV课数据科学社区

【学习】七天搞定SAS(一):数据的导入、数据结构

SAS的数据类型 ? 首先,sas的编程大概就两块:Data和PROC,这个倒是蛮清晰的划分。然后目前关注data部分。 SAS的数据类型还真的只有两种:数字和...

40312
来自专栏码匠的流水账

聊聊lettuce的指标监控

lettuce-core-5.0.4.RELEASE-sources.jar!/io/lettuce/core/event/metrics/DefaultCom...

2742
来自专栏CodingToDie

Awesome 项目

6995
来自专栏互联网杂技

Angularjs中UI Router超级详细的教程{{下}}

接着上一 state间如何传字符串参数 在路由中这样设置: .state('content.photos.detail.comment',{ url:'/co...

5005
来自专栏月色的自留地

macOS的OpenCL高性能计算

1948

扫码关注云+社区

领取腾讯云代金券