For what it's worth in R / Bioconductor load these packages
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
(use source("http://bioconductor.org/biocLite.R"); biocLite(c("VariantAnnotation", "TxDb.Hsapiens.UCSC.hg19.knownGene"))
the first time, to install these packages). Then create a convenient short-cut to the annotations, and a GRanges
representing regions of interest (this could be millions of elements long, or derived from a VCF file)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
roi <- GRanges("chr1", IRanges(12049235, width=1))
then locate the variants in the context of the annotation
loc <- refLocsToLocalLocs(roi, txdb)
The return value tells us that it hits two CDS, at the 10th nucleotide / 4th protein location of each.
> loc
GRanges with 2 ranges and 5 elementMetadata cols:
seqnames ranges strand | cdsLoc proteinLoc
<Rle> <IRanges> <Rle> | <IRanges> <CompressedIntegerList>
[1] chr1 [12049235, 12049235] * | [10, 10] 4
[2] chr1 [12049235, 12049235] * | [10, 10] 4
queryID txID cdsID
<integer> <character> <integer>
[1] 1 650 1992
[2] 1 651 1992
---
seqlengths:
chr1
NA
To find out the gene to which the transcript belongs,
> select(txdb, key=values(loc)$txID, cols="GENEID", keytype="TXID")
TXID GENEID
1 650 9927
2 651 9927
See ?refLocsToLocalLocs
, ?select
for some additional help. With
library(org.Hs.eg.db)
one could go on to discover
> select(org.Hs.eg.db, "9927", c("GENENAME", "SYMBOL"))
ENTREZID GENENAME SYMBOL
1 9927 mitofusin 2 MFN2
See the Annotation and Variants work flows.
up vote for AA position.