The results on this page are based on 31 GTEx datasets. After preprocessing each dataset using lenient filters in order to keep data for as many genes and samples as possible, we compared methods commonly used in RNA-seq analysis to effectively construct one coexpression network per dataset (i.e. building 31 GTEx networks).
Key stages of data processing and network building tested here: 1. Within-sample normalization: counts per million (CPM), transcripts per million (TPM), and reads per kilobase per million (RPKM), 2. Between-sample normalization: quantile (QNT), trimmed mean of M-values (TMM), and upper quartile (UQ), and 3. Network transformation: weighted topological overlap (WTO) and context likelihood of relatedness (CLR).
Individual methods are in regular font (e.g. TPM normalization) and workflows are italicized (e.g. TPM, which TPM combined with no between-sample normalization and no network transformation, or TPM_CLR, which is TPM paired with just CLR).
The evaluation metrics are:
Log2(auPRC/prior); auPRC is the area under the Precision-Recall Curve, prior is the expected auPRC from a random classifier. Focus on this metric to pick the workflow that results in an overall high-quality coexpression network.
auROC; area under the ROC Curve. This commonly-used metric also points to overall quality of the network but is not well-suited for imbalanced problems such as predicting gene functional relationships. We include this metric here for completion.
Log2(p20r/prior); p20r is precision at 20% recall. Focus on this metric if you are interested in applying a threshold to the coexpression network and select high-scoring gene pairs for some subsequent analysis.
The evaluations are based on:
Naive standard: Experimentally-verified gene co-annotations to specific Biological Process terms in the Gene Ontology, and
Tissue-aware standard: The naive gold standard subset by genes known to be expressed in a given tissue.
Sample size refers to the number of samples used to create a network. Each barplot shows the Spearman correlation of sample size and performance for each workflow.
Sample similarity is determined by subsetting all samples to the 50% most variable genes in the tissue it came from, then calculating the spearman correlation between all sample pairs in the experiment and taking the median value. Each barplot shows the Spearman correlation of sample similarity and performance for each workflow.
Read count diversity is calulated for a given experiment by taking the standard deviation of the sum of counts in each sample. Each barplot shows the Spearman correlation of read count diversity and performance for each workflow.
Each point in the interactive scatterplots represents the performance (y-axis) of each workflow (color of the point) in each tissue (x-axis).