In
"The impact of rare variation on gene expression across tissues" Nature 550, 239–243 (12 October 2017) doi:10.1038/nature24267
As far as I understand, using the GTex data, the authors have written a predictive algorithm (RIVER) to predict the consequences on gene expression of a set of variants.
Can you please explain me the Figure 5c. "Performance of RIVER for prioritizing functional regulatory variants. "
http://www.nature.com/nature/journal/v550/n7675/full/nature24267.html#f5
Distribution of RIVER scores (shades of blue) as a function of expression and genomic annotation scores. The distributions of variant categories across expression and genomic annotation scores are shown as histograms aligned opposite the corresponding axes.
I don't understand how I should read that figure ? What is the Y axis ? Whare are the red/oranges circles in the figure ?
I am still trying to understand but the biorxiv version has a better legend for the same figure - "Distribution of RIVER scores (shades of blue) as a function of scores from genomic annotation or gene expression alone. Pathogenic SNVs annotated in ClinVar are shown in red if they were likely regulatory (nonsense, splice-site, or synonymous) and orange otherwise (missense). The distributions of variant categories across absolute median Z-scores and predictions from genomic annotation are shown as histograms aligned opposite the corresponding axes"
https://www.biorxiv.org/content/biorxiv/early/2016/09/09/074443.full.pdf
Pathogenic SNVs annotated in ClinVar are shown in:
@aditi.qamra @cpad0112 thanks for the colored-dots ! :-) I still don't get the whole figure itself. How should I read it ? Why is it interesting ?
Here's my quick attempt - Again from the biorxiv version - Although RIVER was trained in an unsupervised manner, the learned model prioritized variants that were supported by both extreme expression levels for a nearby gene and genomic annotations suggestive of potential impact (Fig.5c). Rather than using a heuristic or manual approach, RIVER automatically learns the relationship between genomic annotations and changes in gene expression from data to provide a coherent estimate of the probability of regulatory impact.
So variants with higher expression level and higher RIVER(G only) score will be prioritised. Outliers according to their code have been categorised as those with median score >=2 (line 96 https://github.com/joed3/GTExV6PRareVariation/blob/master/call_outliers/call_outliers_medz.R) so you start seeing more blues around that (?). And no, its not interesting or clear.
If you really want to get in deep here's the code for the figure - https://github.com/joed3/GTExV6PRareVariation/blob/master/paper_figures/figure5c.R :)
p.s please correct me if I'm wrong - which I very well might be :)
To my understanding authors are trying to show how RIVER model (genomic and transcriptomic information integrated) is good predicting out come compared to those only use genomic annotations