I'd like to do something in R that seems rather simple: draw a nice plot of a human gene, with exons and introns marked (maybe even using a different color for CDS vs. UTRs, but I can live without that for now), with the position of a bunch of mutations overlaid. I've tried a few packages for this, the best of which (so far, has been ggbio
. Here's some code for reproducibility:
library(Homo.sapiens)
library(ggbio)
library(tidyverse)
# first get gene structure information from Homo.sapiens annotation package
genes_all <- AnnotationDbi::keys(Homo.sapiens, keytype='GENEID')
generanges <- AnnotationDbi::select(Homo.sapiens,
keys=genes_all, keytype='GENEID',
columns=c('GENEID', 'SYMBOL', 'TXCHROM', 'TXSTART', 'TXEND'))
generanges <- drop_na(generanges) # need to get rid of these in order to indexm later
# convert to GRanges object
generanges <- makeGRangesFromDataFrame(generanges, keep.extra.columns=T)
# we'll use RSPO1 gene as an example
generanges[generanges$SYMBOL=='RSPO1']
# simulate 200 SNPs randomly scattered through RSPO1
snps <- sample(start(generanges[generanges$SYMBOL=='RSPO1']):end(generanges[generanges$SYMBOL=='RSPO1']),
200, replace=F)
# plot gene structures with autoplot, add SNP positions with geom_jitter
autoplot(Homo.sapiens, which=generanges[generanges$SYMBOL=='RSPO1'],
gap.geom='chevron', col='blue', fill='blue') +
geom_jitter(data=tibble(pos=snps), aes(x=pos, y=1), col='red', alpha=0.25) +
theme_bw()
The resulting plot looks like this:
This is not a bad start, but I would really like to show only a single transcript, not all four splice forms - e.g. the "canonical" Refseq transcript. This is just for ease of visualization/presentation. I think what I need to do is to subset the annotation object Homo.sapiens
, so that it contains only Refseq transcripts, and then pass that object to the autoplot
command. I'm not sure how to do this, though - or if there is a better way to achieve this goal with a different package. (Note that I tried dandelion and lollipop plots using the trackViewer
package, but the SNP density is way too high to visualize that way. Also, I really like the fact that ggbio
can be combined with traditional ggplot commands, as I've done here.)
You may want to consider getting
MANE select
transcripts from NCBI. Read about the project here.Thanks for this suggestion. Unfortunately, though, not all the genes I'm looking at are covered by MANE. In the end, I was able to kludge together a solution using
ggbio
for plotting and a combination ofbiovizBase
,Homo.sapiens
andTxDb.Hsapiens.UCSC.hg19.knownGene
to pull out and draw only the Refseq transcript for each gene, which I could then decorate withggplot
.