Description,strengths(+) and limitations(-)
• A tool that uses .fastq, .bam or .sam files to identify and highlight potential issues in the data, such as low base quality scores, low sequence quality and GC content biases. + Can be used either with or without user interface. − Uses only the first 200 000 sequences in the file.
+ A tool with a wider range of quality control measures than FastQC. + Can also be used on mapped data to obtain information on metrics such as the prevalence of splicing events.
+ This is a similar tool to RSeQC but incorporates more quality control metrics.
• The first widely used mapping tool.
+ Detects splice variants. − Currently much slower than most other mappers and requires a relatively large amount of memory. • A widely used tool to align reads to a genome.+ Maps ∼50 times faster than Tophat and Tophat2.+ Commonly used tool to detect novel splice variants. − Uses a large amount of memory (>20 GB for mapping to the human genome).
• A widely used tool to align reads to a genome at a faster rate than STAR with comparable accuracy. + HISAT2 is expected to be the core of the next version of Tophat (Tophat3). + Detects novel splice variants. + The newer HISAT2 version aligns to genotype variants, likely achieving higher accuracy. + Uses less memory than STAR (<8 GB for mapping to the human genome using default settings).
• A commonly used aligner for species in which splicing does not occur. − Does not detect splice variants.
• A tool that uses a pseudoalignment strategy to assign expression values to transcripts/genes to achieve optimal speed. • Comparable accuracy to other tools using real alignment strategies. • Reports reads/expression per gene instead of read alignment coordinates (which are commonly used to acquire the expression per gene). + Uses little memory and can be run on a regular desktop computer. − Does not identify novel splice variants
• Another pseudoalignment tool. Performance comparable with Kallisto. • Reports reads/expression per gene instead of read alignment coordinates (which are commonly used to acquire the expression per gene). − Does not identify novel splice variants.
Read counting tools
+ A tool that is similar to HTseq but much faster. Results are slightly different owing to slightly different expression assignment strategies.
• A tool that divides the reads mapping to an exon shared with two isoforms proportionally to the total expression of each of the two whole isoforms. + Estimates expression more accurately when multiple genes/transcripts partly share the same genome regions.
• Widely used normalization methods that correct for the total number of reads in a sample while accounting for gene length. − TMM has been suggested as a better alternative
• A method similar to FPKM, but normalizes the total expression to 1 million, i.e. the summed expression of TPM-normalized samples is always 1 million.
• Similar to FPKM/RPKM but puts expression measures on a common scale across different samples.
• A method that uses ratios between counts of genes in each sample for normalizations. + Avoids problems caused by differential transcript abundance between samples (resulting from differential expression of highly abundant gene transcripts).
• A normalization method that adjusts the expression values of each gene in a sample by a set factor. This factor is determined by taking the median gene expression in a sample after dividing the expression of each gene by the geometric mean of the given gene across all samples. This differs from the normalization implemented in the DEseq2 differential expression analysis. • Implemented into the DEseq2 R package.
Correction for batch effects
• A method which uses linear models to correct for batch effects.
• This method estimates biases based on genes that have no phenotypic expression effects, which are then used for correction of the data. • Specifically designed for RNA-seq data.
• A method that is robust to outliers and also effective at batch effect correction in small sample sizes (<25).
Co-expression module detection
• A tool that constructs a co-expression network using Pearson correlation (default) or a custom distance measure.• Uses hierarchical clustering and has various ‘tree cutting’ options to identify modules. + Most widely used tool, well supported and documented.
• A method that uses a similar approach to WGCNA to identify and group differentially co-expressed genes instead of identifying co-expressed modules.
• A method that identifies modules that correlate differently between sample groups, e.g. modules that form one large interconnected module in one group compared with several smaller modules in another group.
• A tool that identifies co-expression modules in each sample group and tests whether the genes within these modules are also co-expressed in other groups.
• DINGO is a more recent tool that groups genes based on how differently they behave in a particular subset of samples (representing e.g. a particular condition) from the baseline co-expression determined from all samples
• A tool that tests whether a predefined defined gene set is differentially expressed between two sample groups.
• A method that identifies ‘genelets’, which can be interpreted as modules representing partial co-expression signals from multiple genes. These signals are then compared between two groups to identify genelets unique to samples and genelets that are shared between the two groups.
• A tool similar to GSVD, but that can be used across multiple sample groups rather than only two.
• A group of methods that identify modules that are unique to a subpopulation of samples without the need for prior grouping of samples.
• A tool that uses a comprehensive protein library combined with human curated pathways and evolutionary ontology. • If a gene is not in the library, it is classified based on its protein sequence conservation and by finding a related gene.
• A widely used tool with an online web interface. Users supply a list of genes and select the annotation categories from various sources to identify enrichment.
• A tool that performs enrichment analyses for gene ontologies, KEGG pathways, protein–protein interactions, TF and miRNA binding sites. + Also available as an R package.
• An R package for overrepresentation and gene set enrichment analyses for several curated gene sets. + Allows users to compare the results of analyses performed on several gene sets.
• An intuitive web tool for performing gene overrepresentation analyses using a comprehensive set of functional annotations.
• An intuitive tool that determines enrichment of different categories such as GO terms, chromosomal locations and disease associations. Enrichment for TFBS and miRNA+ Also has other functions, such as candidate gene prioritization, based on network structures.
Regulatory network inference
• A tool that removes indirect connections between genes (i.e. partners of a gene that have a stronger correlation with each other than with the gene itself), leaving only those connections that are expected to be regulatory. + Creates directional networks.
• A tool that incorporates TF information to construct a regulatory network by determining the TF expression pattern that best explains the expression of each of their target genes. + Creates directional networks. − Requires TF information.
• A tool that identifies co-operative regulators of genes from different data types.
• Calculates joint bicluster membership probability from different data types by identifying groups of genes that group together in multiple data types.
• A widely used tool for the visualization of networks. + Has many plug-ins available for specific analyses.
• Similar to Cytoscape but less widely used. + Can load and visualize much larger networks than Cytoscape.
• A web resource incorporating 12 co-expression networks for different species created from ∼157 000 microarrays and 10 000 RNA-seq samples. Has a focus on protein-coding RNAs.
• Human and mouse gene and transcript co-expression networks. • Networks constructed from ∼4000 RNA-seq samples each. + Includes a number of non-coding RNAs (∼10 000 for mouse and ∼25 000 for human).
• Also includes physical and genetic interaction, co-localization, pathway and shared protein domain information data sets. + Networks for nine species.
• A database constructed using ∼145 000 samples. + Curated database. + Networks for 18 species. + Multiple data types.
• Tissue-specific interaction network database. • Includes 987 Datasets encompassing 38 000 conditions describing 144 tissues types. + Integrates physical interaction, co-expression, miRNA binding motif and TF binding site data.