Question

Amino Acid Change To Genomic Location

11

Entering edit mode

13.0 years ago

Preethi ▴ 110

This is the reverse of what is asked in http://biostar.stackexchange.com/questions/6297/genomic-change-to-aa-change

I have a list of Gene names and related amino acid changes that I am looking for and would like to know the genomic co-ordinates of the SNP locations causing these. What is the best way to get to this?

snp amino-acids • 32k views

ADD COMMENT • link updated 12 months ago by gernophil ▴ 120 • written 13.0 years ago by Preethi ▴ 110

0

Entering edit mode

Here is an example of what I was talking about: BRAF.p.V600E:c.1799T>A

Given just this information, is there a way I can get to the genomic coordinates. I guess there is the CDS position, can this direct us to the genomic location?

ADD REPLY • link 13.0 years ago by Preethi ▴ 110

0

Entering edit mode

Here is an example of what I was talking about: BRAF.p.V600E:c.1799T>A Given just this information, is there a way I can get to the genomic coordinates. I guess there is the CDS position, does this help?

ADD REPLY • link 13.0 years ago by Preethi ▴ 110

Ram · Answer 1 · 2012-01-06

I wrote a tool named backlocate for this job .

~~This tool is available in my experimental package http://code.google.com/p/variationtoolkit/.~~

example:

echo -e  "NOTCH2\tM1T\nEIF4G1\tD240Y" |\
    backlocate -f /path/to/hg19.fa 

#User.Gene    AA1    petide.pos.1    AA2    knownGene.name    knownGene.strand    knownGene.AA    index0.in.rna    codon    base.in.rna    chromosome    index0.in.genomic    exon
##uc001eik.2
NOTCH2    M    1    T    uc001eik.2    -    M    0    ATG    A    chr1    120612019    Exon 1
NOTCH2    M    1    T    uc001eik.2    -    M    1    ATG    T    chr1    120612018    Exon 1
NOTCH2    M    1    T    uc001eik.2    -    M    2    ATG    G    chr1    120612017    Exon 1
##uc001eil.2
NOTCH2    M    1    T    uc001eil.2    -    M    0    ATG    A    chr1    120612019    Exon 1
NOTCH2    M    1    T    uc001eil.2    -    M    1    ATG    T    chr1    120612018    Exon 1
NOTCH2    M    1    T    uc001eil.2    -    M    2    ATG    G    chr1    120612017    Exon 1
##uc001eim.3
NOTCH2    M    1    T    uc001eim.3    -    M    0    ATG    A    chr1    120548116    Exon 2
NOTCH2    M    1    T    uc001eim.3    -    M    1    ATG    T    chr1    120548115    Exon 2
NOTCH2    M    1    T    uc001eim.3    -    M    2    ATG    G    chr1    120548114    Exon 2
##Warning ref aminod acid for uc003fnp.2  [240] is not the same (I/D)
EIF4G1    D    240    Y    uc003fnp.2    +    I    717    ATC    A    chr3    184039089    Exon 10
EIF4G1    D    240    Y    uc003fnp.2    +    I    718    ATC    T    chr3    184039090    Exon 10
EIF4G1    D    240    Y    uc003fnp.2    +    I    719    ATC    C    chr3    184039091    Exon 10
##Warning ref aminod acid for uc003fnu.3  [240] is not the same (I/D)
EIF4G1    D    240    Y    uc003fnu.3    +    I    717    ATC    A    chr3    184039089    Exon 9
EIF4G1    D    240    Y    uc003fnu.3    +    I    718    ATC    T    chr3    184039090    Exon 9
EIF4G1    D    240    Y    uc003fnu.3    +    I    719    ATC    C    chr3    184039091    Exon 9
##Warning ref aminod acid for uc003fnq.2  [240] is not the same (V/D)
EIF4G1    D    240    Y    uc003fnq.2    +    V    717    GTA    G    chr3    184039350    Exon 7
EIF4G1    D    240    Y    uc003fnq.2    +    V    718    GTA    T    chr3    184039351    Exon 7
EIF4G1    D    240    Y    uc003fnq.2    +    V    719    GTA    A    chr3    184039352    Exon 7
##Warning ref aminod acid for uc003fnr.2  [240] is not the same (L/D)
EIF4G1    D    240    Y    uc003fnr.2    +    L    717    CTC    C    chr3    184039581    Exon 6
EIF4G1    D    240    Y    uc003fnr.2    +    L    718    CTC    T    chr3    184039582    Exon 6
EIF4G1    D    240    Y    uc003fnr.2    +    L    719    CTC    C    chr3    184039583    Exon 6
##Warning ref aminod acid for uc003fny.3  [240] is not the same (T/D)
EIF4G1    D    240    Y    uc003fny.3    +    T    717    ACC    A    chr3    184039677    Exon 3
EIF4G1    D    240    Y    uc003fny.3    +    T    718    ACC    C    chr3    184039678    Exon 3
EIF4G1    D    240    Y    uc003fny.3    +    T    719    ACC    C    chr3    184039679    Exon 3
##uc010hxx.2
EIF4G1    D    240    Y    uc010hxx.2    +    D    717    GAT    G    chr3    184038780    Exon 10
EIF4G1    D    240    Y    uc010hxx.2    +    D    718    GAT    A    chr3    184039069    Exon 11
EIF4G1    D    240    Y    uc010hxx.2    +    D    719    GAT    T    chr3    184039070    Exon 11
##Warning ref aminod acid for uc003fns.2  [240] is not the same (L/D)
EIF4G1    D    240    Y    uc003fns.2    +    L    717    CTC    C    chr3    184039209    Exon 10
EIF4G1    D    240    Y    uc003fns.2    +    L    718    CTC    T    chr3    184039210    Exon 10
EIF4G1    D    240    Y    uc003fns.2    +    L    719    CTC    C    chr3    184039211    Exon 10

Ram · Answer 2 · 2016-01-11

6

Entering edit mode

9.0 years ago

michael.d.mclellan ▴ 150

http://bioinformatics.mdanderson.org/main/Transvar

Introduction

TransVar is a reverse annotator for inferring genomic characterization(s) of mutations (e.g., chr3:178936091 G/A) from protein or cDNA annotation(s) (e.g., PIK3CA p.E545K or PIK3CA c.1633G>A). It is designed for resolving ambiguous mutation origins, arising from alternative splicing.

TransVar has the following features:

supports HGVS nomenclature
supports both left-alignment and right-alignment convention in reporting indels.
supports annotation of a region based on a transcript dependent characterization
supports single nucleotide variation (SNV), insertions and deletions (indels) and block substitutions
supports mutations at both coding region and intronic/UTR regions
supports transcript annotation from commonly-used databases such as Ensembl, NCBI RefSeq and GENCODE etc
supports UniProt protein id as transcript id
supports GRCh36, 37, 38
functionality of forward annotation.

Citation: Zhou W, Chen T, Chong Z, Rohrdanz MA, Melott JM, Wakefield C, Zeng J, Weinstein JN, Meric-Bernstam F, Mills GB, Chen K. TransVar: a multi-level variant annotator for precision genomics. Nature Methods. In Press.

ADD COMMENT • link updated 5.6 years ago by Ram 44k • written 9.0 years ago by michael.d.mclellan ▴ 150

0

Entering edit mode

This saved me a lot of time. Thanks!

ADD REPLY • link 7.4 years ago by stachele • 0

0

Entering edit mode

I just want to point out that the project has moved to GitHub

ADD REPLY • link 6.5 years ago by Eli Korvigo ▴ 230

0

Entering edit mode

Is this still the best tool to do this? Or is there something like a reverse VEP?

ADD REPLY • link 12 months ago by gernophil ▴ 120

Ram · Answer 3 · 2013-02-06

5

Entering edit mode

11.9 years ago

Emily 24k

Have you tried the Ensembl REST API?

To convert from protein coordinates http://beta.rest.ensembl.org/documentation/info/assembly_translation

To convert from transcript coordinates

You'd need to get the Ensembl protein or transcript IDs, which you can get easily using BioMart (tutorial on BioMart here

Then, for your query, BRAF.p.V600E:c.1799T>A, input

From protein

http://beta.rest.ensembl.org/map/translation/ENSP00000288602/600..600?content-type=application/json

Output

{"mappings":[{"seq_region_name":"7","gap":0,"coord_system":"chromosome","strand":-1,"rank":0,"end":140453135,"start":140453137}]

From transcript

http://beta.rest.ensembl.org/map/cds/ENST00000288602/1799?content-type=application/json

{"mappings":[{"seq_region_name":"7","gap":0,"coord_system":"chromosome","strand":-1,"rank":0,"end":140453136,"start":140453136}]}

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 11.9 years ago by Emily 24k

0

Entering edit mode

I like the idea of being able to use ensembl, but is there any way to get the nucleotide as well! My example variant was: VAR_031436 Q9NXK6 MPRG_HUMAN p.Ile24Thr
Which I need nucleotide and chromosome-coordinate for, and this is the corresponding link which gives me chr-location, but not the corresponding nucleotide change (am assuming it would be just one possible nucleotide combination of reference and alternate, so there is no ambiguity and one-one correspondence) http://beta.rest.ensembl.org/map/translation/ENSP00000343877/24..24?content-type=application/json

Any thoughts?

ADD REPLY • link 11.3 years ago by Bioinfosm ▴ 620

1

Entering edit mode

You can get this using the Ensembl VEP. http://www.ensembl.org/info/docs/tools/vep/index.html.

Use the annotation you have there, selecting HGVS annotation as your input file format. This will tell you the location you hit, the nucleotide/codon change, the gene/transcript/protein sequence it hits.

ADD REPLY • link 11.3 years ago by Emily 24k

0

Entering edit mode

Emily, thanks for pointing that! I was able to format my data to HGVS and use VEP to obtain coordinate and codon. However, some do not get any output via VEP!! ENSP00000256339:p.Val1597Ala returns a blank in VEP. Though it works just fine in polyphen2! http://genetics.bwh.harvard.edu/ggi/pph2/3a7d3e9d7dc0c940e1b608c62ba2c28296dcfe2b/1771738.html

ADD REPLY • link 11.3 years ago by Bioinfosm ▴ 620

0

Entering edit mode

Hi. I think the issue here is that we don't have a Valine annotated at position 1597 in ENSP00000256339. We have that position as a Proline (http://www.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=core;g=ENSG00000133958;r=14:93799565-94173618;t=ENST00000256339). I put in the input ENSP00000256339:p.Pro1597Ala instead and got four hits. Perhaps this is the wrong protein ID?

ADD REPLY • link 11.3 years ago by Emily 24k

0

Entering edit mode

The uniprot ID Q9P2D8 supposedly maps to this ENSP which seems incorrect! In fact, it should have been ENSP00000376858, which works out fine with VEP as well. Am checking with the Uniprot team on that.

Thanks once again!

ADD REPLY • link 11.3 years ago by Bioinfosm ▴ 620

0

Entering edit mode

To convert from transcript coordinates you'd need to get the Ensembl protein or transcript IDs, which you can get easily using BioMart (tutorial on BioMart here

@Emily_Ensembl I think you forgot to include the link to the BioMart tutorial on how to go from genomic coordinate to transcript ID. Can you post it here? Thanks.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 9.4 years ago by Tommy Carstensen ▴ 210

score 0 · Answer 4 · 2012-01-06

0

Entering edit mode

13.0 years ago

Zev.Kronenberg 12k

You can find the 3 b.p. codon which encodes the amino acid, but you need more information if you want single base pair resolution...

ADD COMMENT • link 13.0 years ago by Zev.Kronenberg 12k

score 0 · Answer 5 · 2012-12-14

0

Entering edit mode

12.1 years ago

nonish5 ▴ 40

Can I use it if I don't know the amino acid position of the protein? I addition, as far as I understand, gene name and mutation info (either of the form c.123G>T or IVS4+1G>T) are not enough in order to deduce a specific genomic location as there may be more than a single transcript. Am I wrong?

ADD COMMENT • link 12.1 years ago by nonish5 ▴ 40

0

Entering edit mode

f I don't know the amino acid position of the protein? you could loop over all the positions of the protein. "are not enough in order to deduce a specific genomic location as there may be more than a single transcript": of course. Furthermore, the very same protein can be encoded by two mRNA.

ADD REPLY • link 12.1 years ago by Pierre Lindenbaum 164k

Ram · Answer 6 · 2013-02-06

Albeit built for mapping protein sequence intervals, my script "protein2genome.pl" can do this. It ships with the variant annotation tool CooVar.

Here is how it works. First you need a GFF or a GTF file with the coordinates of your genes. To test your specific example, I created a GTF file that contains only the first isoform of the BRAF gene, but you could also work with GFF/GTF files containing all human genes and isoforms or containing genes from any other organism. Then you can run protein2genome.pl like this:

echo "ENST00000288602 . . 600 600 . . . ID=V600E" | perl protein2genome.pl BRAF-001.gtf

Produces the output:

7    .    .    140453135    140453137    .    -    .    ID=V600E(ENST00000288602);segment=1of1;p_start=600;p_end=600

Explanation: I am basically piping a GFF-compliant input line specifying the ID of the transcript and the position of the protein sequence change into the script. The script then outputs the mapped genomic coordinates (chromosome 7, codon start=140453135, codon-end=140453137).

A more detailed explanation of input and output formats can be found here.

I would say this script is more useful for non-model organisms, because for model organisms with associated databases you have other possibilities to do that (see other answers in this thread, for example the Ensembl REST API which is quite neat).